I'm creating a Django-powered site for my newspaper-ish site. The least obvious and common-sense task that I have come across in getting the site together is how best to generate a "top articles" list for the sidebar of the page.
The first thing that came to mind was some sort of database column that is updated (based on what?) with every view. That seems (to my instincts) ridiculously database intensive and impractical and thus I think I'd like to find another solution.
Thanks all.
I would give celery a try (with django-celery). While it's not so easy to configure and use as cache, it enables you to queue tasks like incrementing counters and do them in background. It could be even combined with cache technique - in views increment counters in cache and define PeriodicTask that will run every now and then, resetting counters and writing them to the database.
I just remembered - I once found this blog entry which provides nice way of incrementing 'viewed_count' (or similar) column in database with AJAX JS call. If you don't have heavy traffic maybe it's good idea?
Also mentioned in this post is django-tracking, but I don't know much about it, I never used it myself (yet).
Premature optimization, first try the db way and then see if it really is too database sensitive. Any decent database has so good caches it probably won't matter very much. And even if it is a problem, take a look at the other db/cache suggestions here.
It is most likely by the way is that you will have many more intensive db queries with each view than a simple view update.
If you do something like sort by top views, it would be fast if you index the view column in the DB. Another option is to only collect the top x articles every hour or so, and toss that value into Django's cache framework.
The nice thing about caching the list is that the algorithm you use to determine top articles can be as complex as you like without hitting the DB hard with every page view. Django's cache framework can use memory, db, or file system. I prefer DB, but many others prefer memory. I believe it uses pickle, so you can also store Python objects directly. It's easy to use, recommended.
An index wouldn't help as them main problem I believe is not so much getting the sorted list as having a DB write with every page view of an article. Another index actually makes that problem worse, albeit only a little.
So I'd go with the cache. I think django's cache shim is a problem here because it requires timeouts on all keys. I'm not sure if that's imposed by memcached, if not then go with redis. Actually just go with redis anyway, the python library is great, I've used it from django projects before, and it has atomic increments and powerful sorting - everything you need.
Related
My view displays a table of data (specific to a customer who may have many users) and this table takes up a lot of computational resource to populate. A customers data changes 4/5 times a week, usually on the same day.
Caching is an obvious solution to this but I was wondering if Django's cache framework is significantly more efficient than creating a Textfield at the customer level and storing the data there instead?
I feel it's easier to implement (and clear the Textfield when the data changes) but what are the drawbacks and is there anything else I need to look out for? (problems if the dataset gets too big? additional fields in the model etc etc???)
Any help would be much appreciated!
A cache is a cache is cache, however you implement it, and the main problem with caches is invalidation.
As Melvyn rightly answered, the case for the cache framework is that it's (well, can be, depending on which backend you choose) outside your database. Whether it's a pro or cons really depends on your database load, infrastructure and whatnots... if you already use the cache framework (for more than plain unconditional full-page caching I mean) and want to mimimize the load on your database then it's possibly worth the added complexity.
Else storing your computed result in the db is quite straightforward and doesn't require additional servers, install etc. I'd personnally go for a dedicated model - to avoid unnecessary overhead at the db level -, including both the cached result and a checksum of the params on which this result depends (canonical memoization pattern) so you can easily detect whether it needs to be recomputed. I found this solution to be easier to maintain than trying to detect changes to each and any of those params and invalid/recompute the cache "on the fly" (which is what can make proper cache invalidation difficult or at least complex to implement) but this again depends on what those params are and where they come from.
The upside to using the cache framework is that you don't have to use the database. You can scale your cache store independent of your database and run the cache on different physical (or virtual) machines.
In addition you don't have to implement the stale vs fresh logic, but that's a one-off.
4-5 times a week doesn't look like a big challenge, but nobody knows except you what kind of computation do you have, how many data you should store, how many users do you have and so on.
If you want to implement this with TextField, it still some kind of caching system, so I suggest to use django's caching system with database backend first https://docs.djangoproject.com/en/1.11/topics/cache/#database-caching You can't retrieve data with 1 query like in case of TextField, but later you can replace database with other layer if necessary.
I'm trying to figure out the best approach to keeping track of the number of entities of a certain NDB kind I have in my cloud datastore.
One approach is just, when I want to know how many I have, get the .count() of a query that I know will return all of them, but that costs a ton of datastore small operations (looks like it's proportional to the number of entities of that kind I have). So that's not ideal.
Another option would be having a counter in the datastore that gets updated every time I create or delete an entity, but that's also not ideal because it would add an extra read and write operation to every entity I create or destroy.
As of now, it looks like the second option is my best choice, so my question is--do you agree? Are there any other options that would be more cost-effective?
Thanks a lot.
PS: Working in Python if that makes a difference.
Second option is the way to go.
Other considerations:
If you have many writes per second you may wish to consider using a shared counter
To reduce datastore writes, you could use a cron job to update the datastore at timed intervals (ie count how many entities have been created since last run)
Also consider using memcache.incr() in conjunction with a cron job to persist the data. Downside of this is you're memcache key could drop, so only really an option if the count doesn't have to be accurate.
There's actually a better/cheaper/faster way to see the info you are looking for but it might not work if you need to know EXACT number of fields at any given moment since its only updated couple of times a day (i.e. you can access it anytime but it may be a few hours outdated).
The "Datastore Statistics" page in GAE dashboard displays some detailed data about kinds/entities including "count" numbers and there's a way to access it programmatically. See more info here: https://cloud.google.com/appengine/docs/python/datastore/stats
I have query written in raw sql in Django..
Suppose the result of that query is assigned to a variable queryResult.
I then loop this queryResult, then retrive data from almost three tables using django ORM.
For example..
for item in queryResult:
a=table1.objects.get(id=item[0])
b=table2.objects.get(id=item[1])
c=table2.objects.get(id=item[2])
z=a.result
x=a.result1
v=c.result
####based on some condition check the data is stored into a list as dictionary.
recentDocsList.append({'PurchaseType':item[0],
'CaseName':z,
'DocketNumber':x,
'CourtID':item[2],
'PacerCmecf':v,
'DID':item[3]})
After completing the loop this recentDocsList is returned back...
But the entire thing is making my to page render slowly. Anybody has any method to resolve this issue.
PS: The entire thing is inside a while loop. At a time only 50 results are retrieved. The control comes out of the while loop if the result retrieved is less than 50 or the
recentDocsList length is equal to 10.
Thanks in advance.
Don't optimize too early - this can create obfuscation and confusion.
Even using SQLite3 you should be able to pull 50 chained querysets without taxing the DB (upping to a higher performance DB like PostgreSQL would improve this further). This would suggest that your problem is elsewhere, to debug this try calling your models / queries / views in
$ ./manage.py debugshell
and this will print out your SQL queries so you can see what is actually being called. Even better would be to install django-debug-toolbar as this would inform you where the SQL / rendering slow downs are.
But! Unless you have a really good reason to do so, DON'T WRITE CUSTOM SQL to be executed in django - the ORM can take care of almost everything. Some of the dangers of custom SQL include terrible performance - as you're probably experiencing.
Further - a while loop in a performance sensitive place (like page rendering) sounds like a disaster waiting to happen - are you sure you can't rewrite this in a safer way?
Without seeing more code it's difficult to help - how large are your query sets? Are they efficient? Do you have indexes to your tables? (Django will provide these if you allow it, but it sounds like you're doing something different).
I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?
At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.
The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.
If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.
I've written an application for Google AppEngine, and I'd like to make use of the memcache API to cut down on per-request CPU time. I've profiled the application and found that a large chunk of the CPU time is in template rendering and API calls to the datastore, and after chatting with a co-worker I jumped (perhaps a bit early?) to the conclusion that caching a chunk of a page's rendered HTML would cut down on the CPU time per request significantly. The caching pattern is pretty clean, but the question of where to put this logic of caching and evicting is a bit of a mystery to me.
For example, imagine an application's main page has an Announcements section. This section would need to be re-rendered after:
first read for anyone in the account,
a new announcement being added, and
an old announcement being deleted
Some options of where to put the evict_announcements_section_from_cache() method call:
in the Announcement Model's .delete(), and .put() methods
in the RequestHandler's .post() method
anywhere else?
Then in the RequestHandler's get page, I could potentially call get_announcements_section() which would follow the standard memcache pattern (check cache, add to cache on miss, return value) and pass that HTML down to the template for that chunk of the page.
Is it the typical design pattern to put the cache-evicting logic in the Model, or the Controller/RequestHandler, or somewhere else? Ideally I'd like to avoid having evicting logic with tentacles all over the code.
I've got just such a decorator up in an open source Github project:
http://github.com/jamslevy/gae_memoize/tree/master
It's a bit more in-depth, allowing for things like forcing execution of the function (when you want to refresh the cache) or forcing caching locally...these were just things that I needed in my app, so I baked them into my memoize decorator.
A couple of alternatives to regular eviction:
The obvious one: Don't evict, and set a timer instead. Even a really short one - a few seconds - can cut down on effort a huge amount for a popular app, without users even noticing data may be a few seconds stale.
Instead of evicting, generate the cache key based on criteria that change when the data does. For example, if retrieving the key of the most recent announcement is cheap, you could use that as part of the key of the cached data. When a new announcement is posted, you go looking for a key that doesn't exist, and create a new one as a result.