I have a list of entities which I want to store in the memcache. The
problem is that I have large Models referenced by their
ReferenceProperty which are automatically also stored in the memcache.
As a result I'm exceeding the size limit for objects stored in
memcache.
Is there any possibility to prevent the ReferenceProperties from
loading the referenced Models while putting them in memcache?
I tried something like
def __getstate__(self):
odict = self.__dict__.copy()
odict['model'] = None
return odict
in the class I want to store in memcache, but that doesn't seem to do
the trick.
Any suggestions would be highly appreciated.
Edit: I verified by adding a logging-statement that the __getstate__-Method is executed.
For large entities, you might want to manually handle the loading of the related entities by storing the keys of the large entities as something other than a ReferenceProperty. That way you can choose when to load the large entity and when not to. Just use a long property store ids or a string property to store keynames.
odict = self.copy()
del odict.model
would probably be better than using dict (unless getstate needs to return dict - i'm not familiar with it). Not sure if this solves Your problem, though... You could implement del in Model to test if it's freed. For me it looks like You still hold a reference somewhere.
Also check out the pickle module - you would have to store everything under a single key, but it automaticly protects You from multiple references to the same object (stores it only once). Sorry no link, mobile client ;)
Good luck!
Related
As a caveat: I am an utter novice here. I wouldn't be surprised to learn a) this is already answered, but I can't find it because I lack the vocabulary to describe my problem or b) my question is basically silly to begin with, because what I want to do is silly.
Is there some way to store a reference to a class instance that defined and stored in active memory and not stored in NDB? I'm trying to write an app that would help manage a number of characters/guilds in an MMO. I have a class, CharacterClass, that includes properties such as armor, name, etc. that I define in main.py as a base python object, and then define the properties for each of the classes in the game. Each Character, which would be stored in Datastore, would have a property charClass, which would be a reference to one of those instances of CharacterClass. In theory I would be able to do things like
if character.charClass.armor == "Cloth":
while storing the potentially hundreds of unique characters and their specifc data in Datastore, but without creating a copy of "Cloth" for every cloth-armor character, or querying Datastore for what kind of armor a mage wears thousands of times a day.
I don't know what kind of NDB property to use in Character to store the reference to the applicable CharacterClass. Or if that's the right way to do it, even. Thanks for taking the time to puzzle through my confused question.
A string is all you need. You just need to fetch the class based on the string value. You could create a custom property that automatically instantiates the class on reference.
However I have a feeling that hard coding the values in code might be a bit unwieldy. May be you character class instances should be datastore entities as well. It means you can adjust these parameters without deploying new code.
If you want these objects in memory then you can pre-cache them on warmup.
ReferenceProperty was very helpful in handling references between two modules. Fox example:
class UserProf(db.Model):
name = db.StringProperty(required=True)
class Team(db.Model):
manager_name = db.ReferenceProperty(UserProf, collection_name='teams')
name = db.StringProperty(required=True)
To get 'manager_name' with team instance, we use team_ins.manager_name.
To get 'teams' which are managed by particular user instance, we use user_instance.teams and iterate over.
Doesn't it look easy and understandable?
In doing same thing using NDB, we have to modify
db.ReferenceProperty(UserProf, collection_name='teams') --> ndb.KeyProperty(kind=UserProf)
team_ins.manager_name.get() would give you manager name
To get all team which are manger by particular user, we have to do
for team in Team.query(Team.manager_name == user_ins.key):
print "team name:", team.name
As you can see handling these kind of scenarios looks easier and readable in db than ndb.
What is the reason for removing ReferenceProperty in ndb?
Even db's query user_instance.teams would have doing the same thing as it is done in ndb's for loop. But in ndb, we are explicitly mentioning using for loop.
What is happening behind the scenes when we do user_instance.teams?
Thanks in advance..
Tim explained it well. We found that a common anti-pattern was using reference properties and loading them one at a time, because the notation "entity.property1.property2" doesn't make it clear that the first dot causes a database "get" operation. So we made it more obvious by forcing you to write "entity.property1.get().property2", and we made it easier to do batch prefetching (without the complex solution from Nick's blog) by simply saying "entity.property1.get_async()" for a bunch of entities -- this queues a single batch get operation without blocking for the result, and when you next reference any of these properties using "entity.property1.get().property2" this won't start another get operation but just waits for that batch get to complete (and the second time you do this, the batch get is already complete). Also this way in-process and memcache integration comes for free.
I don't know the answer as to why Guido didn't implement reference property.
However I found a spent a lot of time using pre_fetch_refprops http://blog.notdot.net/2010/01/ReferenceProperty-prefetching-in-App-Engine (pre fetches all of the reference properties by grabbing all the keys with get_value_for_datastore,) and then it does a get_multi on the keys.
This was vastly more efficient.
Also if the object referenced doesn't exist you would get an error when trying to fetch the object.
If you pickled an object which had references you ended up pickling a lot more than you probably planned too.
So I found except for the one case, where you have single entity and you wanted to grab the referenced object with .name type accessor you had to jump through all sorts of hoops to prevent the referenced entity from being fetched.
I have a memory issue with mongoengine (in python).
Let's say I have a very large amount of custom_documents (several thousands).
I want to process them all, like this:
for item in custom_documents.objects():
process(item)
The problem is custom_documents.objects() load every objects in memory and my app use several GB ...
How can I do to make it more memory wise?
Is there a way to make mongoengine to query the DB lazily (it request objects when we iterates on the queryset)?
According to the docs (and in my experience), collection.objects returns a lazy QuerySet. Your first problem might be that you're calling the objects attribute, rather than just using it as an iterable. I feel like there must be some other reason your app is using so much memory, perhaps process(object) stores a reference to it somehow? Try the following code and check your app's memory usage:
queryset = custom_documents.objects
print queryset.count()
Since QuerySets are lazy, you can do things like custom_documents.limit(100).skip(500) as well in order to return objects 500-600 only.
I think you want to look at querysets - these are the MongoEngine wrapper for cursors:
http://mongoengine.org/docs/v0.4/apireference.html#querying
They let you control the number of objects returned, essentially taking care of the batch size settings etc. that you can set directly in the pymongo driver:
http://api.mongodb.org/python/current/api/pymongo/cursor.html
Cursors are set up to generally behave this way by default, you have to try to get them to return everything in one shot, even in the native mongodb shell.
I have a need for some kind of information that is in essence static. There is not much of this information, but alot of objects will use that information.
Since there is not a lot of that information (few dictionaries and some lists), I thought that I have 2 options - create models for holding that information in the database or write them as dictionaries/lists to some settings file. My question is - which is faster, to read that information from the database or from a settings file? In either case I need to be able to access that information in lot of places, which would mean alot of database read calls. So which would be faster?
If they're truly never, ever going to change, then feel free to put them in your settings.py file as you would declare a normal Python dictionary.
However, if you want your information to be modifiable through the normal Django methods, then use the database for persistent storage, and then make the most of Django's cache framework.
Save your data to the database as normal, and then the first time it is accessed, cache them:
from django.core.cache import cache
def some_view_that_accesses_date(request):
my_data = cache.get('some_key')
if my_data is None:
my_data = MyObject.objects.all()
cache.set('some_key', my_data)
... snip ... normal view code
Make sure never to save None in a cache, as:
We advise against storing the literal
value None in the cache, because you
won't be able to distinguish between
your stored None value and a cache
miss signified by a return value of
None.
Make sure you kill the cache on object deletion or change:
from django.core.cache import cache
from django.db.models.signals import post_save
from myapp.models import MyModel
def kill_object_cache(sender, **kwargs):
cache.delete('some_key')
post_save.connect(kill_object_cache, sender=MyModel)
post_delete.connect(kill_object_cache, sender=MyModel)
I've got something similar to this in one of my apps, and it works great. Obviously you won't see any performance improvements if you then go and use the database backend, but this is a more Django-like (Djangonic?) approach than using memcached directly.
Obviously it's probably worth defining the cache key some_key somewhere, rather than littering it all over your code, the examples above are just intended to be easy to follow, rather than necessarily full-blown implementations of caching.
If the data is static, there is no need to keep going back to the database. Just read it the first time it is required and cache the result.
If there is some reason you can't cache the result in your app, you can always use memcached to avoid hitting the database.
The advantage of using memcached is that if the data does change, you can simply update the value in memcached.
Pseudocode for using memcached
if 'foo' in memcached
data = memcached.get('foo')
else
data = database.get('foo')
memcached.put('foo', data)
If you need fast access from multiple processes, then a database is the best option for you.
However, if you just want to keep data in memory and access it from multiple places in the same process, then Python dictionaries will be faster than accessing a DB.
How can I cache a Reference Property in Google App Engine?
For example, let's say I have the following models:
class Many(db.Model):
few = db.ReferenceProperty(Few)
class Few(db.Model):
year = db.IntegerProperty()
Then I create many Many's that point to only one Few:
one_few = Few.get_or_insert(year=2009)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Now, if I want to iterate over all the Many's, reading their few value, I would do this:
for many in Many.all().fetch(1000):
print "%s" % many.few.year
The question is:
Will each access to many.few trigger a database lookup?
If yes, is it possible to cache somewhere, as only one lookup should be enough to bring the same entity every time?
As noted in one comment: I know about memcache, but I'm not sure how I can "inject it" when I'm calling the other entity through a reference.
In any case memcache wouldn't be useful, as I need caching within an execution, not between them. Using memcache wouldn't help optimizing this call.
The first time you dereference any reference property, the entity is fetched - even if you'd previously fetched the same entity associated with a different reference property. This involves a datastore get operation, which isn't as expensive as a query, but is still worth avoiding if you can.
There's a good module that adds seamless caching of entities available here. It works at a lower level of the datastore, and will cache all datastore gets, not just dereferencing ReferenceProperties.
If you want to resolve a bunch of reference properties at once, there's another way: You can retrieve all the keys and fetch the entities in a single round trip, like so:
keys = [MyModel.ref.get_value_for_datastore(x) for x in referers]
referees = db.get(keys)
Finally, I've written a library that monkeypatches the db module to locally cache entities on a per-request basis (no memcache involved). It's available, here. One warning, though: It's got unit tests, but it's not widely used, so it could be broken.
The question is:
Will each access to many.few trigger a database lookup? Yes. Not sure if its 1 or 2 calls
If yes, is it possible to cache somewhere, as only one lookup should be enough to bring the same entity every time? You should be able to use the memcache repository to do this. This is in the google.appengine.api.memcache package.
Details for memcache are in http://code.google.com/appengine/docs/python/memcache/usingmemcache.html