Caching data from other websites in Django

Caching data from other websites in Django - python

Suppose I have a simple view which needs to parse data from an external website.
Right now it looks something like this:
def index(request):
source = urllib2.urlopen(EXTERNAL_WEBSITE_URL)
bs = BeautifulSoup.BeautifulSoup(source.read())
finalList = [] # do whatever with bs to populate the list
return render_to_response('someTemplate.html', {'finalList': finalList})
First of all, is this an acceptable use?
Obviously, this is not good performance-wise. The external website page is pretty big, and I am only extracting a small part of it. I thought of two solutions:
Do all of this asynchronously. Load the rest of the page, populate with data once I get it. But I don't even know where to start. I'm just starting with Django and never done anything async up until now.
I don't care if this data is updated every 2-3 minutes, so caching is a good solution as well (also saves me the extra round-trips). How would I go about caching this data?

First, don't optimize prematurely. Get this to work.
Then, add enough logging to see what the performance problems (if any) really are.
You may find that end-user's PC is the slowest part; getting data from another site may, actually, be remarkably fast when you do not fetch .JS libraries and .CSS and artwork and the render then entire thing in a browser.
Once you're absolutely sure that the fetch of the remote content really IS a problem. Really. Then you have to do the following.
Write a "crontab" script that does the remote fetch form time to time.
Design a place to cache the remote results. Database or file system, pick one.
Update your Django app to get the data from the cache (database or filesystem) instead of the remote URL.
Only after you have absolute proof that the urllib2 read of the remote site is the bottleneck.

Caching with django is pretty easy,
from django.core.cache import cache
key = 'some-key'
data = cache.get(key)
if data is None:
# soupify the page and what not
cache.set(data, key, 60*60*8)
return render_to_response ...
return render_to_response
To answer your questions, you can do this asynchronously, but then you would have to use something like django cron to update the cache ever so often. On the other hand you can write this as a standalone python script, replace the cache imported from django with memcache and it would work the same way. It would reduce some of the performance issues your site could have, and as long as you know the cache key, you can retrieve the data from the cache.
Like Jarret said I would read django's caching docs and memcache's docs for more information.

Django has robust, built-in support for caching views: http://docs.djangoproject.com/en/dev/topics/cache/#topics-cache.
It offers solutions for caching entire views (such as in your case), or just certain parts of data in the view. There are even controls for how often to update the cache, and so forth.

Related

Django session race condition?

Summary: is there a race condition in Django sessions, and how do I prevent it?
I have an interesting problem with Django sessions which I think involves a race condition due to simultaneous requests by the same user.
It has occured in a script for uploading several files at the same time, being tested on localhost. I think this makes simultaneous requests from the same user quite likely (low response times due to localhost, long requests due to file uploads). It's still possible for normal requests outside localhost though, just less likely.
I am sending several (file post) requests that I think do this:
Django automatically retrieves the user's session*
Unrelated code that takes some time
Get request.session['files'] (a dictionary)
Append data about the current file to the dictionary
Store the dictionary in request.session['files'] again
Check that it has indeed been stored
More unrelated code that takes time
Django automatically stores the user's session
Here the check at 6. will indicate that the information has indeed been stored in the session. However, future requests indicate that sometimes it has, sometimes it has not.
What I think is happening is that two of these requests (A and B) happen simultaneously. Request A retrieves request.session['files'] first, then B does the same, changes it and stores it. When A finally finishes, it overwrites the session changes by B.
Two questions:
Is this indeed what is happening? Is the django development server multithreaded? On Google I'm finding pages about making it multithreaded, suggesting that by default it is not? Otherwise, what could be the problem?
If this race condition is the problem, what would be the best way to solve it? It's an inconvenience but not a security concern, so I'd already be happy if the chance can be decreased significantly.
Retrieving the session data right before the changes and saving it right after should decrease the chance significantly I think. However I have not found a way to do this for the request.session, only working around it using django.contrib.sessions.backends.db.SessionStore. However I figure that if I change it that way, Django will just overwrite it with request.session at the end of the request.
So I need a request.session.reload() and request.session.commit(), basically.

Yes, it is possible for a request to start before another has finished. You can check this by printing something at the start and end of a view and launch a bunch of request at the same time.
Indeed the session is loaded before the view and saved after the view. You can reload the session using request.session = engine.SessionStore(session_key) and save it using request.session.save().
Reloading the session however does discard any data added to the session before that (in the view or before it). Saving before reloading would destroy the point of loading late. A better way would be to save the files to the database as a new model.
The essence of the answer is in the discussion of Thomas' answer, which was incomplete so I've posted the complete answer.

Mark just nailed it, only minor addition from me, is how to load that session:
for key in session.keys(): # if you have potential removals
del session[key]
session.update(session.load())
session.modified = False # just making it clean
First line optional, you only need it if certain values might be removed meanwhile from the session.
Last line is optional, if you update the session, then it does not really matter.

That is true. You can confirm it by having a look at the django.contrib.sessions.middleware.SessionMiddleware.
Basically, request.session is loaded before request hits your view (in process_request), and it is updated in the session backend (if needed) after the response has left your view (in process_response).
If what I mean is unclear, you might want to have a look at the django documentation for Middleware.
The best way to solve the issue will depend on what you're trying to achieve with that information. I'll update my answer if you provide that information!

Optimize Django

I have a problem with the speed of page loading.
Now it takes about 7 seconds to load the pages and 2~3 seconds is Django processing.
Obvious thing to blame is my lack of knowledge of architecture, execute average 50 queries, as shown by "Django debug tool bar" when accessing the pages but most of the queries are like "yesterday`s snapshot(group by something)" or "daily snapshot(group by something) before yesterday" and doesn't have to be updated each time.
I am coming out of idea using memory caching or create new table for prepare-possible type of data.
Is there any convention or Design Pattern for this kind of issue?
sample queries are these( I believe they must not query each time on yesterdays data or last month`s data):
SELECT `sample_salestarget`.`id`, `sample_salestarget`.`country_id`, `sample_salestarget`.`year`, `sample_salestarget`.`month`, `sample_salestarget`.`sales` FROM `sample_salestarget` WHERE (`sample_salestarget`.`country_id` = "abc" AND `sample_salestarget`.`month` = 8 AND `sample_salestarget`.`year` = 2012 )
SELECT `sample_dailysummary`.`id`, `sample_dailysummary`.`country_id`, `sample_dailysummary`.`date`, `sample_dailysummary`.`pv_day`, `sample_dailysummary`.`pv_week`, `sample_dailysummary`.`pv_month`, `sample_dailysummary`.`active_uu_day`, `sample_dailysummary`.`active_uu_week`, `sample_dailysummary`.`active_uu_month`, `sample_dailysummary`.`active_uu_7days`, `sample_dailysummary`.`active_uu_30days`, `sample_dailysummary`.`paid_uu_day`, `sample_dailysummary`.`paid_uu_week`, `sample_dailysummary`.`paid_uu_month`, `sample_dailysummary`.`sales_day`, `sample_dailysummary`.`sales_week`, `sample_dailysummary`.`sales_month`, `sample_dailysummary`.`register_uu_day`, `sample_dailysummary`.`register_uu_week`, `sample_dailysummary`.`register_uu_month`, `sample_dailysummary`.`pay_count_day`, `sample_dailysummary`.`pay_count_week`, `sample_dailysummary`.`pay_count_month`, `sample_dailysummary`.`total_user`, `sample_dailysummary`.`inv_access_uu`, `sample_dailysummary`.`inv_sender_uu`, `sample_dailysummary`.`inv_accepted_uu`, `sample_dailysummary`.`inv_send_count`, `sample_dailysummary`.`memo`, `sample_dailysummary`.`first_charge_uu` FROM `sample_dailysummary` WHERE `sample_dailysummary`.`date` = 2012-09-07 AND `sample_dailysummary`.`country_id` = "abc" )

Using Memcached can really speed things up for you. However, that does come with it's problems. You have to be extra careful on dynamic pages about explicitly invalidating caches whenever required.
Along with Memcached, try johnny-cache which does a very good job of caching your django ORM queries
Also, make use of Django's session variables as far as possible. (Try the cached_db session engine if you're using Memcached.) You could save objects (like your user profile settings) which stay consistent throughout a session. This way you're reducing the number of sql calls again.
And if you really really need quick pageloads.. Maybe try loading your page and then asynchronously calling your sql statements using Celery and load your results in an AJAXy manner.

If this is a production application to be exposed to the internet, and you can't reduce the number of queries you make then you should at least reuse the answers, I would suggest using django's built in DB cache to store database results in ram using memcached. If this is a local app then i would suggest django's ram based cache. the reason for this is memcached is able to be scaled a lot further than django's but django's requires little setup
Caching for Django

How do I cache a list/dictionary in Pylons?

On a website I'm making, there's a section that hits the database pretty hard. Harder than I want. The data that's being retrieved is all very static. It will rarely change. So I want to cache it.
I came across http://wiki.pylonshq.com/display/pylonsdocs/Caching+in+Templates+and+Controllers and had a good read have been making use of template caching using:
return render('tmpl.html', cache_expire='never')
That works great until I modify the HTML. The only way I've found to delete the cache is to remove the cache_expire parameter from render() and delete the cache folder. But, meh, it works.
What I want to be able to, however, is cache Lists, Tuples and Dictionaries. From reading the above wiki page, it seems this isn't possible?
I want to be able to do something like:
data = [i for i in range(0, 2000000)]
mycache = cache.get_cache('cachename')
value = mycache.get(key='dataset1', list=data, type='memory', expiretime='3600')
print value
Allowing me to do some CPU intensive work (list generation, in this example) and then cache it.
Can this be done with Pylons?

As alternative of traditional cache you can use app globals variables. Once on server startup load data to variable and then use data in you actions or direct in templates.
http://pylonsbook.com/en/1.1/exploring-pylons.html#app-globals-object
Also you can code some action to update this global variable through the admin interface or by other events.

Why not use memcached?
Look at this question on SO on how to use it with pylons: Pylons and Memcached

Dynamically Created Top Articles List in Django?

I'm creating a Django-powered site for my newspaper-ish site. The least obvious and common-sense task that I have come across in getting the site together is how best to generate a "top articles" list for the sidebar of the page.
The first thing that came to mind was some sort of database column that is updated (based on what?) with every view. That seems (to my instincts) ridiculously database intensive and impractical and thus I think I'd like to find another solution.
Thanks all.

I would give celery a try (with django-celery). While it's not so easy to configure and use as cache, it enables you to queue tasks like incrementing counters and do them in background. It could be even combined with cache technique - in views increment counters in cache and define PeriodicTask that will run every now and then, resetting counters and writing them to the database.
I just remembered - I once found this blog entry which provides nice way of incrementing 'viewed_count' (or similar) column in database with AJAX JS call. If you don't have heavy traffic maybe it's good idea?
Also mentioned in this post is django-tracking, but I don't know much about it, I never used it myself (yet).

Premature optimization, first try the db way and then see if it really is too database sensitive. Any decent database has so good caches it probably won't matter very much. And even if it is a problem, take a look at the other db/cache suggestions here.
It is most likely by the way is that you will have many more intensive db queries with each view than a simple view update.

If you do something like sort by top views, it would be fast if you index the view column in the DB. Another option is to only collect the top x articles every hour or so, and toss that value into Django's cache framework.
The nice thing about caching the list is that the algorithm you use to determine top articles can be as complex as you like without hitting the DB hard with every page view. Django's cache framework can use memory, db, or file system. I prefer DB, but many others prefer memory. I believe it uses pickle, so you can also store Python objects directly. It's easy to use, recommended.

An index wouldn't help as them main problem I believe is not so much getting the sorted list as having a DB write with every page view of an article. Another index actually makes that problem worse, albeit only a little.
So I'd go with the cache. I think django's cache shim is a problem here because it requires timeouts on all keys. I'm not sure if that's imposed by memcached, if not then go with redis. Actually just go with redis anyway, the python library is great, I've used it from django projects before, and it has atomic increments and powerful sorting - everything you need.

Where is the best place to put cache-evicting logic in an AppEngine application?

I've written an application for Google AppEngine, and I'd like to make use of the memcache API to cut down on per-request CPU time. I've profiled the application and found that a large chunk of the CPU time is in template rendering and API calls to the datastore, and after chatting with a co-worker I jumped (perhaps a bit early?) to the conclusion that caching a chunk of a page's rendered HTML would cut down on the CPU time per request significantly. The caching pattern is pretty clean, but the question of where to put this logic of caching and evicting is a bit of a mystery to me.
For example, imagine an application's main page has an Announcements section. This section would need to be re-rendered after:
first read for anyone in the account,
a new announcement being added, and
an old announcement being deleted
Some options of where to put the evict_announcements_section_from_cache() method call:
in the Announcement Model's .delete(), and .put() methods
in the RequestHandler's .post() method
anywhere else?
Then in the RequestHandler's get page, I could potentially call get_announcements_section() which would follow the standard memcache pattern (check cache, add to cache on miss, return value) and pass that HTML down to the template for that chunk of the page.
Is it the typical design pattern to put the cache-evicting logic in the Model, or the Controller/RequestHandler, or somewhere else? Ideally I'd like to avoid having evicting logic with tentacles all over the code.

I've got just such a decorator up in an open source Github project:
http://github.com/jamslevy/gae_memoize/tree/master
It's a bit more in-depth, allowing for things like forcing execution of the function (when you want to refresh the cache) or forcing caching locally...these were just things that I needed in my app, so I baked them into my memoize decorator.

A couple of alternatives to regular eviction:
The obvious one: Don't evict, and set a timer instead. Even a really short one - a few seconds - can cut down on effort a huge amount for a popular app, without users even noticing data may be a few seconds stale.
Instead of evicting, generate the cache key based on criteria that change when the data does. For example, if retrieving the key of the most recent announcement is cheap, you could use that as part of the key of the cached data. When a new announcement is posted, you go looking for a key that doesn't exist, and create a new one as a result.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.