Trying to understand what happens during a django low-level cache.set()
Particularly, details about what part of the queryset gets stored in memcached.
First, am I interpreting the django docs correctly?
a queryset (python object) has/maintains its own cache
access to the database is lazy; even if the queryset.count is 1000,
if I do an object.get for 1 record, then the dbase will only be
accessed once, for that 1 record.
when accessing a django view via apache prefork MPM, everytime that
a particular daemon instance X ends up invoking a particular view that includes something
like "tournres_qset = TournamentResult.objects.all()",
this will then result, each time, in a new tournres_qset object
being created. That is, anything that may have been cached internally
by a tournres_qset python object from a previous (tcp/ip) visit,
is not used at all by a new request's tournres_qset.
Now the questions about saving things to memcached within the view.
Let's say I add something like this at the top of the view:
tournres_qset = cache.get('tournres', None)
if tournres_qset is None:
tournres_qset = TournamentResult.objects.all()
cache.set('tournres', tournres_qset, timeout)
# now start accessing tournres_qset
# ...
What gets stored during the cache.set()?
Does the whole queryset (python object) get serialized and saved?
Since the queryset hasn't been used yet to get any records, is this
just a waste of time, since no particular records' contents are actually
being saved in memcache? (Any future requests will get the queryset
object from memcache, which will always start fresh, with an empty local
queryset cache; access to the dbase will always occur.)
If the above is true, then should I just always re-save the queryset
at the end of the view, after it's been used throughout the vierw to access
some records, which will result in the queryset's local cache to get updated,
and which should always get re-saved to memcached? But then, this would always
result in once again serializing the queryset object.
So much for speeding things up.
Or, does the cache.set() force the queryset object to iterate and
access from the dbase all the records, which will also get saved in
memcache? Everything would get saved, even if the view only accesses
a subset of the query set?
I see pitfalls in all directions, which makes me think that I'm
misunderstanding a whole bunch of things.
Hope this makes sense and appreciate clarifications or pointers to some
"standard" guidelines. Thanks.
Querysets are lazy, which means they don't call the database until they're evaluated. One way they could get evaluated would be to serialize them, which is what cache.set does behind the scenes. So no, this isn't a waste of time: the entire contents of your Tournament model will be cached, if that's what you want. It probably isn't: and if you filter the queryset further, Django will just go back to the database, which would make the whole thing a bit pointless. You should just cache the model instances you actually need.
Note that the third point in your initial set isn't quite right, in that this has nothing to do with Apache or preforking. It's simply that a view is a function like any other, and anything defined in a local variable inside a function goes out of scope when that function returns. So a queryset defined and evaluated inside a view goes out of scope when the view returns the response, and a new one will be created the next time the view is called, ie on the next request. This is the case whichever way you are serving Django.
However, and this is important, if you do something like set your queryset to a global (module-level) variable, it will persist between requests. Most of the ways that Django is served, and this definitely includes mod_wsgi, keep a process alive for many requests before recycling it, so the value of the queryset will be the same for all of those requests. This can be useful as a sort of bargain-basement cache, but is difficult to get right because you have no idea how long the process will last, plus other processes are likely to be running in parallel which have their own versions of that global variable.
Updated to answer questions in the comment
Your questions show that you still haven't quite grokked how querysets work. It's all about when they are evaluated: if you list, or iterate, or slice a queryset, that evaluates it, and it's at that point the database call is made (I count serialization under iterating, here), and the results stored in the queryset's internal cache. So, if you've already done one of those things to your queryset, and then set it to the (external) cache, that won't cause another database hit.
But, every filter() operation on a queryset, even one that's already evaluated, is another database hit. That's because it's a modification of the underlying SQL query, so Django goes back to the database - and returns a new queryset, with its own internal cache.
Related
I don't know where else I could have asked this question, thus asking it here. I want to know that if I impose multiple Django filters on a page which are using multiple db tables, will that effect ram consumption whenever a user visits this page because before the user only filtered data will get reflected. I'm using django with postgresql on a ubuntu based VM, also if there are any documentation which can be helpful in understanding ram utilization, please suggest.
Django filter and query sets are lazy. What it actually means is you are not actually hitting the database until you evaluate them. Quoting official documentation -
Internally, a QuerySet can be constructed, filtered, sliced, and generally passed around without actually hitting the database. No database activity actually occurs until you do something to evaluate the queryset.
So the only space that gets taken in your RAM is actually the list containing queryset and your program. It is when query is evaluated and data is extracted from the database, that is when(depending on how much data is extracted), memory is filled. Also, it'd be a good idea to look at iterators as well
I want to iterate over all the objects in a QuerySet. However, the QuerySet matches hundreds of thousands if not millions of objects. So when I try to start an iteration, my CPU usage goes to 100%, and all my memory fills up, and then things crash. This happens before the first item is returned:
bts = Backtrace.objects.all()
for bt in bts:
print bt
I can ask for an individual object and it returns immediately:
bts = Backtrace.objects.all()
print(bts[5])
But getting a count of all objects crashes just as above, so I can't iterate using this method since I don't know how many objects there will be.
What's a way to iterate without causing the whole result to get pre-cached?
First of all make sure that you understand when a queryset is evaluated (hits the database).
Of course, one approach would be to filter the queryset.
If this is out of the question there are some workarounds you can use, according to your needs:
Queryset Iterator
Pagination
A django snippet that claims to do the job using a generator
Raw SQL using cursors.
Fetch a list of necessary values or a dictionary of necessary values and work with them.
Here is a nice article that tries to tackle with this issue in a more theoretical level.
In the SQLAlchemy docs it says this:
"When using a Session, it’s important to note that the objects which are associated with it are proxy objects to the transaction being held by the Session - there are a variety of events that will cause objects to re-access the database in order to keep synchronized. It is possible to “detach” objects from a Session, and to continue using them, though this practice has its caveats. It’s intended that usually, you’d re-associate detached objects with another Session when you want to work with them again, so that they can resume their normal task of representing database state."
[http://docs.sqlalchemy.org/en/rel_0_9/orm/session.html]
If I am in the middle of a session in which I read some objects, do some manipulations and more queries and save some objects, before committing, is there a risk that changes to the dbase by other users will unexpectedly update my objects while I am working with them?
In other words, what are the "variety of events" referred to above?
Is the answer to set the transaction isolation level to maximum? (I am using postureSQL with Flask-SQLAlchemy and Flask-Restful, if any of that matters.)
No, SQLAlchemy does not monitor the database for changes or update your objects whenever it feels like it. I can imagine it would be quite expensive operation. The "variety of events" refers more to SQLAlchemy's internal state. I'm not familiar with all the "events" but for example when objects are marked as expired, SQLAlchemy automatically reloads them from the database. One such case is calling session.commit() and accessing any object's property again.
More here: Documentation about expiring objects
I have a problem with SQL Alchemy - my app works as a constantly working python application.
I have function like this:
def myFunction(self, param1):
s = select([statsModel.c.STA_ID, statsModel.c.STA_DATE)])\
.select_from(statsModel)
statsResult = self.connection.execute(s).fetchall()
return {'result': statsResult, 'calculation': param1}
I think this is clear example - one result set is fetched from database, second is just passed as argument.
The problem is that when I change data in my database, this function still returns data like nothing was changed. When I change data in input parameter, returned parameter "calculation" has proper value.
When I restart the app server, situation comes back to normal - new data are fetched from MySQL.
I know that there were several questions about SQLAlchemy caching like:
How to disable caching correctly in Sqlalchemy orm session?
How to disable SQLAlchemy caching?
but how other can I call this situation? It seems SQLAlchemy keeps the data fetched before and does not perform new queries until application restart. How can I avoid such behavior?
Calling session.expire_all() will evict all database-loaded data from the session. Any access of object attributes subsequent emits a new SELECT statement and gets new data back. Please see http://docs.sqlalchemy.org/en/latest/orm/session_state_management.html#refreshing-expiring for background.
If you still see so-called "caching" after calling expire_all(), then you need to close out transactions as described in my answer linked above.
A few possibilities.
You are reusing your session improperly or at improper time. Best practice is to throw away your session after you commit, and get a new one at the last possible moment before you use it. The behavior that appears to be caching may in fact be due to a session lifetime being very long in your application.
Objects that survive longer than the session are not being merged into a subsequent session. The "metadata" may not be able to update their state if you do not merge them back in. This is more a concern for the ORM API of SQLAlchemy, which you do not appear so far to be using.
Your changes are not committed. You say they are so we'll assume this is not it, but if none of the other avenues explain it you may want to look again.
One general debugging tip: if you want to know exactly what SQLAlchemy is doing in the database, pass echo=True to the create_engine function. The engine will print all queries it runs.
Also check out this suggestion I made to someone else, who was using ORM and had transactionality problems, which resolved their issue without ever pinpointing it. Maybe it will help you.
You need to change transaction isolation level to READ_COMMITTED
http://docs.sqlalchemy.org/en/rel_0_9/dialects/mysql.html#mysql-isolation-level
How can I cache a Reference Property in Google App Engine?
For example, let's say I have the following models:
class Many(db.Model):
few = db.ReferenceProperty(Few)
class Few(db.Model):
year = db.IntegerProperty()
Then I create many Many's that point to only one Few:
one_few = Few.get_or_insert(year=2009)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Now, if I want to iterate over all the Many's, reading their few value, I would do this:
for many in Many.all().fetch(1000):
print "%s" % many.few.year
The question is:
Will each access to many.few trigger a database lookup?
If yes, is it possible to cache somewhere, as only one lookup should be enough to bring the same entity every time?
As noted in one comment: I know about memcache, but I'm not sure how I can "inject it" when I'm calling the other entity through a reference.
In any case memcache wouldn't be useful, as I need caching within an execution, not between them. Using memcache wouldn't help optimizing this call.
The first time you dereference any reference property, the entity is fetched - even if you'd previously fetched the same entity associated with a different reference property. This involves a datastore get operation, which isn't as expensive as a query, but is still worth avoiding if you can.
There's a good module that adds seamless caching of entities available here. It works at a lower level of the datastore, and will cache all datastore gets, not just dereferencing ReferenceProperties.
If you want to resolve a bunch of reference properties at once, there's another way: You can retrieve all the keys and fetch the entities in a single round trip, like so:
keys = [MyModel.ref.get_value_for_datastore(x) for x in referers]
referees = db.get(keys)
Finally, I've written a library that monkeypatches the db module to locally cache entities on a per-request basis (no memcache involved). It's available, here. One warning, though: It's got unit tests, but it's not widely used, so it could be broken.
The question is:
Will each access to many.few trigger a database lookup? Yes. Not sure if its 1 or 2 calls
If yes, is it possible to cache somewhere, as only one lookup should be enough to bring the same entity every time? You should be able to use the memcache repository to do this. This is in the google.appengine.api.memcache package.
Details for memcache are in http://code.google.com/appengine/docs/python/memcache/usingmemcache.html