Implementing a popularity algorithm in Django

Implementing a popularity algorithm in Django - python

I am creating a site similar to reddit and hacker news that has a database of links and votes. I am implementing hacker news' popularity algorithm and things are going pretty swimmingly until it comes to actually gathering up these links and displaying them. The algorithm is simple:
Y Combinator's Hacker News:
Popularity = (p - 1) / (t + 2)^1.5`
Votes divided by age factor.
Where`
p : votes (points) from users.
t : time since submission in hours.
p is subtracted by 1 to negate submitter's vote.
Age factor is (time since submission in hours plus two) to the power of 1.5.factor is (time since submission in hours plus two) to the power of 1.5.
I asked a very similar question over yonder Complex ordering in Django but instead of contemplating my options I choose one and tried to make it work because that's how I did it with PHP/MySQL but I now know Django does things a lot differently.
My models look something (exactly) like this
class Link(models.Model):
category = models.ForeignKey(Category)
user = models.ForeignKey(User)
created = models.DateTimeField(auto_now_add = True)
modified = models.DateTimeField(auto_now = True)
fame = models.PositiveIntegerField(default = 1)
title = models.CharField(max_length = 256)
url = models.URLField(max_length = 2048)
def __unicode__(self):
return self.title
class Vote(models.Model):
link = models.ForeignKey(Link)
user = models.ForeignKey(User)
created = models.DateTimeField(auto_now_add = True)
modified = models.DateTimeField(auto_now = True)
karma_delta = models.SmallIntegerField()
def __unicode__(self):
return str(self.karma_delta)
and my view:
def index(request):
popular_links = Link.objects.select_related().annotate(karma_total = Sum('vote__karma_delta'))
return render_to_response('links/index.html', {'links': popular_links})
Now from my previous question, I am trying to implement the algorithm using the sorting function. An answer from that question seems to think I should put the algorithm in the select and sort then. I am going to paginate these results so I don't think I can do the sorting in python without grabbing everything. Any suggestions on how I could efficiently do this?
EDIT
This isn't working yet but I think it's a step in the right direction:
from django.shortcuts import render_to_response
from linkett.apps.links.models import *
def index(request):
popular_links = Link.objects.select_related()
popular_links = popular_links.extra(
select = {
'karma_total': 'SUM(vote.karma_delta)',
'popularity': '(karma_total - 1) / POW(2, 1.5)',
},
order_by = ['-popularity']
)
return render_to_response('links/index.html', {'links': popular_links})
This errors out into:
Caught an exception while rendering: column "karma_total" does not exist
LINE 1: SELECT ((karma_total - 1) / POW(2, 1.5)) AS "popularity", (S...
EDIT 2
Better error?
TemplateSyntaxError: Caught an exception while rendering: missing FROM-clause entry for table "vote"
LINE 1: SELECT ((vote.karma_total - 1) / POW(2, 1.5)) AS "popularity...
My index.html is simply:
{% block content %}
{% for link in links %}
karma-up
{{ link.karma_total }}
karma-down
{{ link.title }}
Posted by {{ link.user }} to {{ link.category }} at {{ link.created }}
{% empty %}
No Links
{% endfor %}
{% endblock content %}
EDIT 3
So very close! Again, all these answers are great but I am concentrating on a particular one because I feel it works best for my situation.
from django.db.models import Sum
from django.shortcuts import render_to_response
from linkett.apps.links.models import *
def index(request):
popular_links = Link.objects.select_related().extra(
select = {
'popularity': '(SUM(links_vote.karma_delta) - 1) / POW(2, 1.5)',
},
tables = ['links_link', 'links_vote'],
order_by = ['-popularity'],
)
return render_to_response('links/test.html', {'links': popular_links})
Running this I am presented with an error hating on my lack of group by values. Specifically:
TemplateSyntaxError at /
Caught an exception while rendering: column "links_link.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...karma_delta) - 1) / POW(2, 1.5)) AS "popularity", "links_lin...
Not sure why my links_link.id wouldn't be in my group by but I am not sure how to alter my group by, django usually does that.

On Hacker News, only the 210 newest stories and 210 most popular stories are paginated (7 pages worth * 30 stories each). My guess is that the reason for the limit (at least in part) is this problem.
Why not drop all the fancy SQL for the most popular stories and just keep a running list instead? Once you've established a list of the top 210 stories you only need to worry about reordering when a new vote comes in since relative order is maintained over time. And when a new vote does come in, you only need to worry about reordering the story that received the vote.
If the story that received the vote is not on the list, calculate the score of that story, plus the least popular story that is on the list. If the story that received the vote is lower, you're done. If it's higher, calculate the current score for the second-to-least most popular (story 209) and compare again. Continue working up until you find a story with a higher score and then place the newly-voted-upon story right below that one in the rankings. Unless, of course, it reaches #1.
The benefit of this approach is that it limits the set of stories you have to look at to figure out the top stories list. In the absolute worst case scenario, you have to calculate the ranking for 211 stories. So it's very efficient unless you have to establish the list from an existing data set - but that's just a one-time penalty assuming you cache the list someplace.
Downvotes are another issue, but I can only upvote (at my karma level, anyway).

popular_links = Link.objects.select_related()
popular_links = popular_links.extra(
select = {
'karma_total': 'SUM(vote.karma_delta)',
'popularity': '(karma_total - 1) / POW(2, 1.5)'
},
order_by = ['-popularity']
)
Or select some sane number, sort the selection using python in any way you like, and cache if its going to be static for all users which it looks like it will - set cache expiration to a minute or so.
But the extra will work better for paginated results in a highly dynamic setup.

Seems like you could overload the save of the Vote class and have it update the corresponding Link object. Something like this should work well:
from datetime import datetime, timedelta
class Link(models.Model):
category = models.ForeignKey(Category)
user = models.ForeignKey(User)
created = models.DateTimeField(auto_now_add = True)
modified = models.DateTimeField(auto_now = True)
fame = models.PositiveIntegerField(default = 1)
title = models.CharField(max_length = 256)
url = models.URLField(max_length = 2048)
#a field to keep the most recently calculated popularity
popularity = models.FloatField(default = None)
def CalculatePopularity(self):
"""
Add a shorcut to make life easier ... this is used by the overloaded save() method and
can be used in a management function to do a mass-update periodically
"""
ts = datetime.now()-self.created
th = ts.seconds/60/60
self.popularity = (self.user_set.count()-1)/((th+2)**1.5)
def save(self, *args, **kwargs):
"""
Modify the save function to calculate the popularity
"""
self.CalculatePopularity()
super(Link, self).save(*args, **kwargs)
def __unicode__(self):
return self.title
class Vote(models.Model):
link = models.ForeignKey(Link)
user = models.ForeignKey(User)
created = models.DateTimeField(auto_now_add = True)
modified = models.DateTimeField(auto_now = True)
karma_delta = models.SmallIntegerField()
def save(self, *args, **kwargs):
"""
Modify the save function to calculate the popularity of the Link object
"""
self.link.CalculatePopularity()
super(Vote, self).save(*args, **kwargs)
def __unicode__(self):
return str(self.karma_delta)
This way every time you call a link_o.save() or vote_o.save() it will re-calculate the popularity. You have to be a little careful because when you call Link.objects.all().update('updating something') then it won't call our overloaded save() function. So when I use this sort of thing I create a management command which updates all of the objects so they're not too out of date. Something like this will work wonderfully:
from itertools import imap
imap(lambda x:x.CalculatePopularity(), Link.objects.all().select_related().iterator())
This way it will only load a single Link object into memory at once ... so if you have a giant database it won't cause a memory error.
Now to do your ranking all you have to do is:
Link.objects.all().order_by('-popularity')
It will be super-fast since all of you Link items have already calculated the popularity.

Here was the final answer to my question although many months late and not exactly what I had in mind. Hopefully it will be useful to some.
def hot(request):
links = Link.objects.select_related().annotate(votes=Count('vote')).order_by('-created')[:150]
for link in links:
delta_in_hours = (int(datetime.now().strftime("%s")) - int(link.created.strftime("%s"))) / 3600
link.popularity = ((link.votes - 1) / (delta_in_hours + 2)**1.5)
links = sorted(links, key=lambda x: x.popularity, reverse=True)
links = paginate(request, links, 5)
return direct_to_template(
request,
template = 'links/link_list.html',
extra_context = {
'links': links
})
What's going on here is I pull the latest 150 submissions (5 pages of 30 links each) if you need more obviously you can go grab'em by altering my slice [:150]. This way I don't have to iterate over my queryset which might eventually become very large and really 150 links should be enough procrastination for anybody.
I then calculate the difference in time between now and when the link was created and turn it into hours (not nearly as easy as I thought)
Apply the algorithm to a non-existant field (I like this method because I don't have to store the value in my database and isn't reliant on surrounding links.
The line immediately after the for loop was where I also had another bit of trouble. I can't order_by('popularity') because it's not a real field in my database and is calculated on the fly so I have to convert my queryset into an object list and sort popularity from there.
The next line is just my paginator shortcut, thankfully pagination does not require a queryset unlike some generic views (talking to you object_list).
Spit everything out into a nice direct_to_template generic view and be on my merry way.

Related

Django: Transactions and how to avoid wrong counting?

I am currently struggling with a topic connected to transactions. I implemented a discount functionality. Whenever a sale is made with a discount code, the counter redeemed_quantity is increased by + 1.
Now I thought about the case. What if one or more users redeem a discount at the same time? Assuming redeemed_quantity is 10. User 1 buys the product and redeemed_quantity increases by +1 = 11. Now User 2 clicked on 'Pay' at the same time and again redeemed_quantity increases by +1 = 11. Even so, it should be 12. I learned about #transaction.atomic but I think the way I implemented them here will not help me with what I am actually trying to prevent. Can anyone help me with that?
view.py
class IndexView(TemplateView):
template_name = 'website/index.html'
initial_price_of_course = 100000 # TODO: Move to settings
def check_discount_and_get_price(self):
discount_code_get = self.request.GET.get('discount')
discount_code = Discount.objects.filter(code=discount_code_get).first()
if discount_code:
discount_available = discount_code.available()
if not discount_available:
messages.add_message(
self.request,
messages.WARNING,
'Discount not available anymore.'
)
if discount_code and discount_available:
return discount_code, self.initial_price_of_course - discount_code.value
else:
return discount_code, self.initial_price_of_course
def get_context_data(self, **kwargs):
context = super().get_context_data(**kwargs)
context['stripe_pub_key'] = settings.STRIPE_PUB_KEY
discount_object, course_price = self.check_discount_and_get_price()
context['course_price'] = course_price
return context
#transaction.atomic
def post(self, request, *args, **kwargs):
stripe.api_key = settings.STRIPE_SECRET_KEY
token = request.POST.get('stripeToken')
email = request.POST.get('stripeEmail')
discount_object, course_price = self.check_discount_and_get_price()
charge = stripe.Charge.create(
amount=course_price,
currency='EUR',
description='My Description',
source=token,
receipt_email=email,
)
if charge.paid:
if discount_object:
discount_object.redeemed_quantity += 1
discount_object.save()
order = Order(
total_gross=course_price,
discount=discount_object
)
order.save()
return redirect('website:index')
models.py
class Discount(TimeStampedModel):
code = models.CharField(max_length=20)
value = models.IntegerField() # Smallest currency unit, as amount charged
max_quantity = models.IntegerField()
redeemed_quantity = models.IntegerField(default=0)
def available(self):
available_quantity = self.max_quantity - self.redeemed_quantity
if available_quantity > 0:
return True
class Order(TimeStampedModel):
total_gross = models.IntegerField()
discount = models.ForeignKey(
Discount,
on_delete=models.PROTECT, # Can't delete discount if used.
related_name='orders',
null=True,

You can pass the handling of the incrementation to the database in order to avoid the race condition in your code by using django's F expression:
from django.db.models import F
# ...
discount_object.redeemed_quantity = F('redeemed_quantity') + 1
discount_object.save()
From the docs with a completely analogous example:
Although reporter.stories_filed = F('stories_filed') + 1 looks like a normal Python assignment of value to an instance attribute, in fact it’s an SQL construct describing an operation on the database.
When Django encounters an instance of F(), it overrides the standard Python operators to create an encapsulated SQL expression; in this case, one which instructs the database to increment the database field represented by reporter.stories_filed.

Django is a piece of a synchronous code. It means that every request you make to the server is processed individually. This problem could arise, when there are multiple server-workers (for example uwsgi workers), but again - it's practically impossible to do this. We run a webshop application with multiple workers and something like this never happend.
But back to the question - if you want to query the database to increase a value by one, see schwobaseggl's answer.
The last thing is that I think you misunderstand what transaction.atomic() does. Simply put it rolls back any queries made to the database in a function if function exits with an error to the state when function was called. See this answer and this piece of documentation. Maybe it will clear some things up.

django in one queryset all and filter methods

I trying display in django app, in view last 5 item and also this items which has is_home set on True.
Please hint if this is 'nice' and correct way:
My model:
class Event(models.Model):
title = models.CharField(max_length=500)
date = models.DateField()
is_home = models.BooleanField(default=False)
My query in view:
context['event_list'] = Event.objects.filter(Q(Event.objects.all()) | Event.objects.filter(is_home=True))[:5]

context['event_list'] = Event.objects.filter(is_home=True).order_by(-id)[:5]

Simply use:
list(Event.objects.all().order_by('-id')[:5]) + list(Event.objects.filter(is_home=True))
Unfortunately, you cannot (as far as I can tell) combine queries after taking a slice, so conversion to lists is necessary.
If you really really want to have a QuerySet you can do:
Event.objects.filter(Q(id__in=Event.objects.all().order_by('-id')[:5].values_list('id', flat=True)) | Q(is_home=True))
Which is extremely ugly.

Django - Using signals to make changes to other models

say I have two models like so...
class Product(models.Model):
...
overall_rating = models.IntegerField()
...
class Review(models.Model):
...
product = models.ForeignKey(Product, related_name='review', on_delete=models.CASCADE)
rating = models.IntegerField()
...
I want to use the ratings from all of the child Review objects to build an average overall_rating for the parent Product.
Question: I'm wondering how I may be able to achieve something like this using Django signals?
I am a bit of a newbie to this part of Django; have never really understood the signals part before.
This overall_rating value needs to be stored in the database instead of calculated using a method since I plan on ordering the Product objects based on their overall_rating which is done on a DB level. The method may look something like this if I were to implement it (just for reference):
def overall_rating(self):
review_count=self.review.count()
if review_count >= 1:
ratings=self.review.all().values_list('rating',flat=True)
rating_sum = 0
for i in ratings:
rating_sum += int(i)
return rating_sum / review_count
else:
return 0
Thank you

You want to update your Product after each save of Review. So the best and fastest way would be using post save method. For example, after each saved product you can get all reviews and calculate overall rating and then save it to the Product.
#receiver(post_save, sender=Review, dispatch_uid="update_overall_rating")
def update_rating(sender, instance, **kwargs):
parent = instance.product
all_reviews = Review.objects.filter(product=parent)
parent.overall_rating = get_overall_rating(all_reviews)

Django filter against ForeignKey and by result of manytomany sub query

I've looked at doing a query using an extra and/or annotate but have not been able to get the result I want.
I want to get a list of Products, which has active licenses and also the total number of available licenses. An active license is defined as being not obsolete, in date, and the number of licenses less the number of assigned licenses (as defined by a count on the manytomany field).
The models I have defined are:
class Vendor(models.Model):
name = models.CharField(max_length=200)
url = models.URLField(blank=True)
class Product(models.Model):
name = models.CharField(max_length=200)
vendor = models.ForeignKey(Vendor)
product_url = models.URLField(blank=True)
is_obsolete = models.BooleanField(default=False, help_text="Is this product obsolete?")
class License(models.Model):
product = models.ForeignKey(Product)
num_licenses = models.IntegerField(default=1, help_text="The number of assignable licenses.")
licensee_name = models.CharField(max_length=200, blank=True)
license_key = models.TextField(blank=True)
license_startdate = models.DateField(default=date.today())
license_enddate = models.DateField(null=True, blank=True)
is_obsolete = models.BooleanField(default=False, help_text="Is this licenses obsolete?")
licensees = models.ManyToManyField(User, blank=True)
I have tried filtering by the License model. Which works, but I don't know how to then collate / GROUP BY / aggregate the returned data into a single queryset that is returned.
When trying to filter by procuct, I can quite figure out the query I need to do. I can get bits and pieces, and have tried using a .extra() select= query to return the number of available licenses (which is all I really need at this point) of which there will be multiple licenses associated with a product.
So, the ultimate answer I am after is, how can I retrieve a list of available products with the number of available licenses in Django. I'd rather not resort to using raw as much as possible.
An example queryset that gets all the License details I want, I just can't get the product:
License.objects.annotate(
used_licenses=Count('licensees')
).extra(
select={
'avail_licenses': 'licenses_license.num_licenses - (SELECT count(*) FROM licenses_license_licensees WHERE licenses_license_licensees.license_id = licenses_license.id)'
}
).filter(
is_obsolete=False,
num_licenses__gt=F('used_licenses')
).exclude(
license_enddate__lte=date.today()
)
Thank you in advance.
EDIT (2014-02-11):
I think I've solved it in possibly an ugly way. I didn't want to make too many DB calls if I can, so I get all the information using a License query, then filter it in Python and return it all from inside a manager class. Maybe an overuse of Dict and list. Anyway, it works, and I can expand it with additional info later on without a huge amount of risk or custom SQL. And it also uses some of the models parameters that I have defined in the model class.
class LicenseManager(models.Manager):
def get_available_products(self):
licenses = self.get_queryset().annotate(
used_licenses=Count('licensees')
).extra(
select={
'avail_licenses': 'licenses_license.num_licenses - (SELECT count(*) FROM licenses_license_licensees WHERE licenses_license_licensees.license_id = licenses_license.id)'
}
).filter(
is_obsolete=False,
num_licenses__gt=F('used_licenses')
).exclude(
license_enddate__lte=date.today()
).prefetch_related('product')
products = {}
for lic in licenses:
if lic.product not in products:
products[lic.product] = lic.product
products[lic.product].avail_licenses = lic.avail_licenses
else:
products[lic.product].avail_licenses += lic.avail_licenses
avail_products = []
for prod in products.values():
if prod.avail_licenses > 0:
avail_products.append(prod)
return avail_products
EDIT (2014-02-12):
Okay, this is the final solution I have decided to go with. Uses Python to filter the results. Reduces cache calls, and has a constant number of SQL queries.
The lesson here is that for something with many levels of filtering, it's best to get as much as needed, and filter in Python when returned.
class ProductManager(models.Manager):
def get_all_available(self, curruser):
"""
Gets all available Products that are available to the current user
"""
q = self.get_queryset().select_related().prefetch_related('license', 'license__licensees').filter(
is_obsolete=False,
license__is_obsolete=False
).exclude(
license__enddate__lte=date.today()
).distinct()
# return a curated list. Need further information first
products = []
for x in q:
x.avail_licenses = 0
x.user_assigned = False
# checks licenses. Does this on the model level as it's cached so as to save SQL queries
for y in x.license.all():
if not y.is_active:
break
x.avail_licenses += y.available_licenses
if curruser in y.licensees.all():
x.user_assigned = True
products.append(x)
return q

One strategy would be to get all the product ids from your License queryset:
productIDList = list(License.objects.filter(...).values_list(
'product_id', flat=True))
and then query the products using that list of ids:
Product.objects.filter(id__in=productIDList)

Django-date incrementation in a list with ManyToManyField

New to django/programming, any help is greatly appreciated. I need help moving through a history of doctor appointments and selecting what immunizations were performed at each appointment, then creating a date when the immunization is due in the future (based on an immunization information table, which has the proper interval of immunizations and will increment from the visit date)
models.py
class Immunizations(models.Model):
immunization = models.CharField(max_length=100, null=True)
interval = models.CharField(max_length=5, null=True)**This should probably be an integer field, will change later
class Visit(models.Model):
patient = models.ForeignKey(Patients)
date_of_visit = models.DateField(null=True)
weight = models.CharField(max_length=5, null=True)
immunization = models.ManyToManyField(Immunizations)
timestamp = models.DateTimeField(auto_now_add=True, default=datetime.datetime.now())
active = models.BooleanField(default=True)
I have been reading the documentation and questions on SO all weekend, but am still very conflicted about what way to go through this.
What I want is:
Visit.date_of_visit1
Visit.immunization1, Visit.date_of_visit1 + Immunization.interval1
Visit.immunization2, Visit.date_of_visit1 + Immunization.interval2
Visit.date_of_visit2
Visit.immunization1, Visit.date_of_visit2 + Immunization.interval1
ETC
This could go on for years with each visit having different immunizations performed. I want to maintain a record of which immunization was performed and record the due date, even if that due date has passed.
views.py
def visit_profile(request, slug):
patient = Patients.objects.get(slug=slug)
try:
visit = Visit.objects.filter(patient_id=patient.id)
except:
return HttpResponseRedirect('/')
#Immunization Due Dates
visitdate = Visit.objects.get(patient_id=patient.id, active=1).date_of_visit
imm = Immunizations.objects.all()
visitimm = []
for immunization in imm:
due = Immunizations.objects.get(id= immunization.pk)
duedate = visitdate + timedelta(days=int(due.interval))
visitimm.append((due, duedate))
return render_to_response('patient.html',locals(), context_instance=RequestContext(request))
Need help with my views.py. The above works, but only at showing the active=1 visit information. I can't figure out how to modify/re-do to achieve what I want and be able to access the data in my template file. I've experimented with __in method, itertools, looping, etc. Can anyone provide the proper method/direction for doing this? I will go back and properly setup error catching once I can get the code to work. Thanks!

Yep, make interval an IntegerField or maybe rather a PositiveSmallIntegerField since it will never get a negative value nor a very huge number.
Careful, better don't mix plural and singular in model names, they affect the related names when you traverse your foreign keys which makes it a pain to debug, see here. I prefer to use only singulars.
Instead of:
visit = Visit.objects.filter(patient_id=patient.id)
You can simply type:
visit = Visit.objects.filter(patient=patient)
Try something like this
def visit_profile(request, slug):
patient = Patients.objects.get(slug=slug)
visitimm = []
# Looping over all active visit records of the patient in date order
for v in patient.visit_set
.filter(active=True).order_by('date_of_visit'):
# Looping over each visit's immunizations
for i in v.immunizations_set.all():
duedate = v.date_of_visit + timedelta(days=int(i.interval))
visitimm.append((i, duedate))
...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.