I want to use caching in Django and I am stuck up with how to go about it. I have data in some specific models which are write intensive. records will get added continuously to the model. Each user has some specific data in the model similar to orders table.
Since my model is write intensive I am not sure how effective caching frameworks in Django are going to be. I tried Django view specific caching and I am try to develop a view where first it will pick up data from the cache. Then I will have another call which will bring in data which was added to the model after the caching was done. What I want to do is add the updated data to the original cache data and store it again.
It is like I don't want to expire my cache, I just want to keep adding to my existing cache data. may be once in 3 hrs I can clear it.
Is what I am doing right. Are there better ways than this. Can I really add to items in existing cache.
I will be very glad for your help
You ask about "caching" which is a really broad topic, and the answer is always a mix of opinion, style and the specific app requirements. Here are a few points to consider.
If the data is per user, you can cache it per user:
from django.core.cache import cache
cache.set(request.user.id,"foo")
cache.get(request.user.id)
The common practice it to keep a database flag that tells you if the user's data changed since it was cached. So before you fetch the data from cache, check only this flag from the DB. If the flag says nothing changed, get the data from cache. If it did change, pull from DB, replace the cache, and set the flag again.
The flag check should be fast and simple: one table, indexed by user.id, and a boolean flag field. This will squeeze a lot of index rows into a single DB page, and enables a fast fetching of a single one field row. Yet you still get a persistent updated main storage, that prevents the use of not updated cache data. You can check this flag in a middleware.
You can run expiry in many ways: clear cache when user logs out, run a cron script that clears items, or let the cache backend expire items. If you use a flag check before you use the cache, there is no issue in keeping items in cache except space, and caching backends handle that. If you use the django simple file cache (which is easy, simple and zero config), you will have to clear the cache. A simple cron script will do.
Related
My view displays a table of data (specific to a customer who may have many users) and this table takes up a lot of computational resource to populate. A customers data changes 4/5 times a week, usually on the same day.
Caching is an obvious solution to this but I was wondering if Django's cache framework is significantly more efficient than creating a Textfield at the customer level and storing the data there instead?
I feel it's easier to implement (and clear the Textfield when the data changes) but what are the drawbacks and is there anything else I need to look out for? (problems if the dataset gets too big? additional fields in the model etc etc???)
Any help would be much appreciated!
A cache is a cache is cache, however you implement it, and the main problem with caches is invalidation.
As Melvyn rightly answered, the case for the cache framework is that it's (well, can be, depending on which backend you choose) outside your database. Whether it's a pro or cons really depends on your database load, infrastructure and whatnots... if you already use the cache framework (for more than plain unconditional full-page caching I mean) and want to mimimize the load on your database then it's possibly worth the added complexity.
Else storing your computed result in the db is quite straightforward and doesn't require additional servers, install etc. I'd personnally go for a dedicated model - to avoid unnecessary overhead at the db level -, including both the cached result and a checksum of the params on which this result depends (canonical memoization pattern) so you can easily detect whether it needs to be recomputed. I found this solution to be easier to maintain than trying to detect changes to each and any of those params and invalid/recompute the cache "on the fly" (which is what can make proper cache invalidation difficult or at least complex to implement) but this again depends on what those params are and where they come from.
The upside to using the cache framework is that you don't have to use the database. You can scale your cache store independent of your database and run the cache on different physical (or virtual) machines.
In addition you don't have to implement the stale vs fresh logic, but that's a one-off.
4-5 times a week doesn't look like a big challenge, but nobody knows except you what kind of computation do you have, how many data you should store, how many users do you have and so on.
If you want to implement this with TextField, it still some kind of caching system, so I suggest to use django's caching system with database backend first https://docs.djangoproject.com/en/1.11/topics/cache/#database-caching You can't retrieve data with 1 query like in case of TextField, but later you can replace database with other layer if necessary.
i have written MicroServices like for auth, location, etc.
All of microservices have different database, with for eg location is there in all my databases for these services.When in any of my project i need a location of user, it first looks in cache, if not found it hits the database. So far so good.Now when location is changed in any of my different databases, i need to update it in other databases as well as update my cache.
currently i made a model (called subscription) with url as its field, whenever a location is changed in any database, an object is created of this subscription. A periodic task is running which checks for subscription model, when it finds such objects it hits api of other services and updates location and updates the cache.
I am wondering if there is any better way to do this?
I am wondering if there is any better way to do this?
"better" is entirely subjective. if it meets your needs, it's fine.
something to consider, though: don't store the same information in more than one place.
if you need an address, look it up from the service that provides address, every time.
this may be a performance hit, but it eliminates the problem of replicating the data everywhere.
another option would be a more proactive approach, as suggested in comments.
instead of creating a task list for changes, and doing that periodically, send a message across rabbitmq immediately when the change happens. let every service that needs to know, get a copy of the message and update it's own cache of info.
just remember, though. every time you have more than one copy of the information, you reduce the "correctness" of the system, as a whole. it will always be possible for the information found in one of your apps to be out of date, because it did not get an update from the official source.
Summary: is there a race condition in Django sessions, and how do I prevent it?
I have an interesting problem with Django sessions which I think involves a race condition due to simultaneous requests by the same user.
It has occured in a script for uploading several files at the same time, being tested on localhost. I think this makes simultaneous requests from the same user quite likely (low response times due to localhost, long requests due to file uploads). It's still possible for normal requests outside localhost though, just less likely.
I am sending several (file post) requests that I think do this:
Django automatically retrieves the user's session*
Unrelated code that takes some time
Get request.session['files'] (a dictionary)
Append data about the current file to the dictionary
Store the dictionary in request.session['files'] again
Check that it has indeed been stored
More unrelated code that takes time
Django automatically stores the user's session
Here the check at 6. will indicate that the information has indeed been stored in the session. However, future requests indicate that sometimes it has, sometimes it has not.
What I think is happening is that two of these requests (A and B) happen simultaneously. Request A retrieves request.session['files'] first, then B does the same, changes it and stores it. When A finally finishes, it overwrites the session changes by B.
Two questions:
Is this indeed what is happening? Is the django development server multithreaded? On Google I'm finding pages about making it multithreaded, suggesting that by default it is not? Otherwise, what could be the problem?
If this race condition is the problem, what would be the best way to solve it? It's an inconvenience but not a security concern, so I'd already be happy if the chance can be decreased significantly.
Retrieving the session data right before the changes and saving it right after should decrease the chance significantly I think. However I have not found a way to do this for the request.session, only working around it using django.contrib.sessions.backends.db.SessionStore. However I figure that if I change it that way, Django will just overwrite it with request.session at the end of the request.
So I need a request.session.reload() and request.session.commit(), basically.
Yes, it is possible for a request to start before another has finished. You can check this by printing something at the start and end of a view and launch a bunch of request at the same time.
Indeed the session is loaded before the view and saved after the view. You can reload the session using request.session = engine.SessionStore(session_key) and save it using request.session.save().
Reloading the session however does discard any data added to the session before that (in the view or before it). Saving before reloading would destroy the point of loading late. A better way would be to save the files to the database as a new model.
The essence of the answer is in the discussion of Thomas' answer, which was incomplete so I've posted the complete answer.
Mark just nailed it, only minor addition from me, is how to load that session:
for key in session.keys(): # if you have potential removals
del session[key]
session.update(session.load())
session.modified = False # just making it clean
First line optional, you only need it if certain values might be removed meanwhile from the session.
Last line is optional, if you update the session, then it does not really matter.
That is true. You can confirm it by having a look at the django.contrib.sessions.middleware.SessionMiddleware.
Basically, request.session is loaded before request hits your view (in process_request), and it is updated in the session backend (if needed) after the response has left your view (in process_response).
If what I mean is unclear, you might want to have a look at the django documentation for Middleware.
The best way to solve the issue will depend on what you're trying to achieve with that information. I'll update my answer if you provide that information!
I'm creating a Django-powered site for my newspaper-ish site. The least obvious and common-sense task that I have come across in getting the site together is how best to generate a "top articles" list for the sidebar of the page.
The first thing that came to mind was some sort of database column that is updated (based on what?) with every view. That seems (to my instincts) ridiculously database intensive and impractical and thus I think I'd like to find another solution.
Thanks all.
I would give celery a try (with django-celery). While it's not so easy to configure and use as cache, it enables you to queue tasks like incrementing counters and do them in background. It could be even combined with cache technique - in views increment counters in cache and define PeriodicTask that will run every now and then, resetting counters and writing them to the database.
I just remembered - I once found this blog entry which provides nice way of incrementing 'viewed_count' (or similar) column in database with AJAX JS call. If you don't have heavy traffic maybe it's good idea?
Also mentioned in this post is django-tracking, but I don't know much about it, I never used it myself (yet).
Premature optimization, first try the db way and then see if it really is too database sensitive. Any decent database has so good caches it probably won't matter very much. And even if it is a problem, take a look at the other db/cache suggestions here.
It is most likely by the way is that you will have many more intensive db queries with each view than a simple view update.
If you do something like sort by top views, it would be fast if you index the view column in the DB. Another option is to only collect the top x articles every hour or so, and toss that value into Django's cache framework.
The nice thing about caching the list is that the algorithm you use to determine top articles can be as complex as you like without hitting the DB hard with every page view. Django's cache framework can use memory, db, or file system. I prefer DB, but many others prefer memory. I believe it uses pickle, so you can also store Python objects directly. It's easy to use, recommended.
An index wouldn't help as them main problem I believe is not so much getting the sorted list as having a DB write with every page view of an article. Another index actually makes that problem worse, albeit only a little.
So I'd go with the cache. I think django's cache shim is a problem here because it requires timeouts on all keys. I'm not sure if that's imposed by memcached, if not then go with redis. Actually just go with redis anyway, the python library is great, I've used it from django projects before, and it has atomic increments and powerful sorting - everything you need.
I want to load info from another site (this part is done), but i am doing this every time the page is loaded and that wont do. So i was thinking of having a variable in a table of settings like 'last checked bbc site' and when the page loads it would check if its been long enough since last check to check again. Is there anything silly about doing it that way?
Also do i absolutely have to use tables to store 1 off variables like this setting?
I think there are 2 options that would work for you, besides creating a entity in the datastore to keep track of "last visited time".
One way is to just check the external page periodically, using the cron api as described by jldupont.
The second way is to store the last visited time in memcache. Although memcache is not permanent, it doesn't have to be if you are only storing last refresh times. If your entry in memcache were to disappear for some reason, the worst that would happen would be that you would fetch the page again, and update memcache with the current date/time.
The first way would be best if you want to check the external page at regular intervals. The second way might be better if you want to check the external page only when a user clicks on your page, and you haven't fetched that page yourself in the recent past. With this method, you aren't wasting resources fetching the external page unless someone is actually looking for data related to it.
You could also use Scheduled Tasks.
Also, you don't absolutely need to use the Datastore for configuration parameters: you could have this in a script / config file.
If you want some handler on your GAE app (including one for a scheduled task, reception of messages, web page visits, etc) to store some new information in such a way that some handler in the future can recover that information, then GAE's storage is the only good general way (memcache could expire from under you, for example). Not sure what you mean by "tables" (?!), but guessing that you actually mean GAE's storage the answer is "yes". (Under very specific circumstances you might want to put that data to some different place on the network, such as your visitor's browser e.g. via cookies, or an Amazon storage instance, etc, but it does not appear to me that those specific circumstances are appliable to your use case).