I'm thinking of adding a reputation system to my Django web application; the site is already being used so I'm trying to be careful about my choices.
Reputation is generated in all actions that contribute to the site, similar to Stackoverflow's system.
I know there are literally millions of ways of implementing this, and this is why I feel quite lost.
Two alternatives I am not sure about are:
Keep track of reasons why reputation was incremented
Ignore reasons in order to reduce complexity of the site and overhead
Would be happy with a few pointers, and directions. Would be very much appreciated!
In Django, I'd suggest having a property on the User (or Profile) model that calculates a user's reputation on-demand. Then, cache the reputation with your caching framework and/or store to the database for fast retrieval.
This way, in addition to having the records of what impacts reputation, you can change your reputation criteria at will.
keep track of the reasons, IMHO. It surely wouldn't be that complex, and you don't need to store a huge amount of information, just a datetime, point value, command, target, and originator. If the data gets to be too much after some time dump the DB to a backup medium and clear the history.
Related
My view displays a table of data (specific to a customer who may have many users) and this table takes up a lot of computational resource to populate. A customers data changes 4/5 times a week, usually on the same day.
Caching is an obvious solution to this but I was wondering if Django's cache framework is significantly more efficient than creating a Textfield at the customer level and storing the data there instead?
I feel it's easier to implement (and clear the Textfield when the data changes) but what are the drawbacks and is there anything else I need to look out for? (problems if the dataset gets too big? additional fields in the model etc etc???)
Any help would be much appreciated!
A cache is a cache is cache, however you implement it, and the main problem with caches is invalidation.
As Melvyn rightly answered, the case for the cache framework is that it's (well, can be, depending on which backend you choose) outside your database. Whether it's a pro or cons really depends on your database load, infrastructure and whatnots... if you already use the cache framework (for more than plain unconditional full-page caching I mean) and want to mimimize the load on your database then it's possibly worth the added complexity.
Else storing your computed result in the db is quite straightforward and doesn't require additional servers, install etc. I'd personnally go for a dedicated model - to avoid unnecessary overhead at the db level -, including both the cached result and a checksum of the params on which this result depends (canonical memoization pattern) so you can easily detect whether it needs to be recomputed. I found this solution to be easier to maintain than trying to detect changes to each and any of those params and invalid/recompute the cache "on the fly" (which is what can make proper cache invalidation difficult or at least complex to implement) but this again depends on what those params are and where they come from.
The upside to using the cache framework is that you don't have to use the database. You can scale your cache store independent of your database and run the cache on different physical (or virtual) machines.
In addition you don't have to implement the stale vs fresh logic, but that's a one-off.
4-5 times a week doesn't look like a big challenge, but nobody knows except you what kind of computation do you have, how many data you should store, how many users do you have and so on.
If you want to implement this with TextField, it still some kind of caching system, so I suggest to use django's caching system with database backend first https://docs.djangoproject.com/en/1.11/topics/cache/#database-caching You can't retrieve data with 1 query like in case of TextField, but later you can replace database with other layer if necessary.
So I am debating whether or not to use Django's select_related or not for performance issues.
In the documentation, it says that this is a "performance booster" because it does not need to query the database anymore, but that would clearly mean it has to store a lot more data locally, which can be exhaustive if you need to do a lot of separate calls for a lot of different users.
What are the pros and cons of performance with Django's select_related? And when should (or shouldn't) it be used?
If you dont use select_related or not you will eat memory each time you access a related object, so if you have to access related objects it won't make that much of a difference wrt/ memory usage and can indeed save a lot of db access cost - specially if your db server is not on the same node as your django instance(s). To make a long story short:
as a general guideline: use select_related (with appropriate params to limit what relationships should be followed) when you know you'll need the related object.
if in doubt, don't try to guess, test and profile (yes it requires quite some infrastructure to do proper testing and profiling here but hey, that's how it is).
My own experience: careful use of select_related can vastly improve execution time, never had a problem with memory but we usually do our best to avoid loading millions of rows when we just need a couple ones (doing proper filtering, slicing etc before the query is actually eval'd).
We are considering implementing a voting system (up, down votes) without using any type of credentials--no app accounts nor OpenID or anything of that sort.
Concerns in order:
Prevent robot votes
Allow individuals under a NAT to vote without overriding/invalidating someone else's vote
Preventing (or, at the very least making very difficult for) users to vote more than once
My questions:
If you've implemented something similar, any tips?
Any concerns that perhaps I'm overlooking?
Any tools that I should perhaps look into?
If you have any questions that would help for you in forming an answer to any of these questions, please ask in the comments!
To address your concerns:
1: a simple Captcha would probably do the trick, if you google "django captcha", there are a bunch of plugins. I've never used them myself, so I can't say which is the best.
2 & 3: Using Django's sessions addresses both of these problems - with it you could save a cookie on the user's browser to indicate that the person has already voted. This obviously allows people to vote via different browsers or by clearing their cache, so it depends on how important it is that people not be allowed to vote twice. I would imagine that only a small percentage of people would actually think to try clearing their cache, though. As far as I know the only other way to limit users without a sign-in process would be to test IP addresses, but that would violate your second criteria since people on the same network will show up as having the same IP address.
If you don't want multiple votes to be as simple as deleting browser cookies, you could also allow facebook or twitter login - the django-socialregistration plugin is pretty well documented and straightforward to implement.
Hope that helps!
Recaptcha is an excellent choice. For Django, here's the one that I've had the most success with, which actually uses images loaded from Recaptcha (as opposed to local images generated on the fly):
http://pypi.python.org/pypi/recaptcha-client#downloads
Instructions for installation are in this snippet:
http://djangosnippets.org/snippets/433/
If Recaptcha is a bit unwieldy for what you're doing, I've heard of people implementing a form that loads with a hidden input containing a timestamp value, corresponding to when the form was loaded. Then, when the form is submitted, generate a new timestamp and get the difference between the two. If the difference in seconds is below a certain threshold that's unreasonable for a human visitor, chances are you have a bot. This works for contact forms with several fields...it usually takes a person more than 10 seconds to fill them out.
I can't speak to how effective this technique actually is in production....a lot of these spam bots these days are smarter than I am. But it might be something you'd consider looking into or testing.
I'm creating a Django-powered site for my newspaper-ish site. The least obvious and common-sense task that I have come across in getting the site together is how best to generate a "top articles" list for the sidebar of the page.
The first thing that came to mind was some sort of database column that is updated (based on what?) with every view. That seems (to my instincts) ridiculously database intensive and impractical and thus I think I'd like to find another solution.
Thanks all.
I would give celery a try (with django-celery). While it's not so easy to configure and use as cache, it enables you to queue tasks like incrementing counters and do them in background. It could be even combined with cache technique - in views increment counters in cache and define PeriodicTask that will run every now and then, resetting counters and writing them to the database.
I just remembered - I once found this blog entry which provides nice way of incrementing 'viewed_count' (or similar) column in database with AJAX JS call. If you don't have heavy traffic maybe it's good idea?
Also mentioned in this post is django-tracking, but I don't know much about it, I never used it myself (yet).
Premature optimization, first try the db way and then see if it really is too database sensitive. Any decent database has so good caches it probably won't matter very much. And even if it is a problem, take a look at the other db/cache suggestions here.
It is most likely by the way is that you will have many more intensive db queries with each view than a simple view update.
If you do something like sort by top views, it would be fast if you index the view column in the DB. Another option is to only collect the top x articles every hour or so, and toss that value into Django's cache framework.
The nice thing about caching the list is that the algorithm you use to determine top articles can be as complex as you like without hitting the DB hard with every page view. Django's cache framework can use memory, db, or file system. I prefer DB, but many others prefer memory. I believe it uses pickle, so you can also store Python objects directly. It's easy to use, recommended.
An index wouldn't help as them main problem I believe is not so much getting the sorted list as having a DB write with every page view of an article. Another index actually makes that problem worse, albeit only a little.
So I'd go with the cache. I think django's cache shim is a problem here because it requires timeouts on all keys. I'm not sure if that's imposed by memcached, if not then go with redis. Actually just go with redis anyway, the python library is great, I've used it from django projects before, and it has atomic increments and powerful sorting - everything you need.
I was working on a project a few months ago, and had the need to implement an award system. Similar to StackOverflow's badge system. Badges
I might have not implemented it in the best possible way, and I am curious what your say in it would be.
What would a good way to track user activities, needed for badge awarding be?
Stackoverflow's system needs to know of a lot of information, and I also get the impression that there would be a lot of data processing complicating things.
I would assume that SO calculates badges once or twice every 24, and that maybe logs are stored or a server dedicated to badge calculation.
Thoughts?
I don't think is as complicated as you think. I highly doubt that SO calculates badges with some kind of user activity log (although technically the entire database is a user activity log). When I look at the lists of badges, I don't see anything that can't be implemented by running a SQL select query.
Some of the queries could be pretty complicated, and there might be some sort of fancy caching mechanism, but I don't see any reason why you would have to calculate badges in batches.
In general badge/point systems can be based on two things.
Activity log of interesting events, this is effectively the paper register receipt of what has happend such that you can re-compute from the ground up if it's ever needed. Can be as simple as (user_id, timestamp, event_id, event_detail)
Most of the time you've pre-designed your scoring/point system so you know exactly which counters to keep on a user. Now it's as simple as having a big record that contains all of the details. (user_id, reply_points, login_points, last_login, thumbs_up_points, etc.,etc.)
Now you can slap some simple methods on that model object and have it manage/store the points as needed.