There is going to be "articles" and "tags" in my App Engine application.
And there are two techniques to implement that (thanks to Nick Johnson's article):
# one entity just refers others
class Article(db.Model):
tags = db.ListProperty(Tag)
# via separate "join" table
class ArticlesAndTags(db.Model):
article = db.ReferenceProperty(Article)
tag = db.ReferenceProperty(Tag)
Which one should I prefer according to the following tasks?
Create tag cloud (frequently),
Select articles by a tag (rather rarely)
Because of the lack of a 'reduce' feature in appengine's map reduce (nor an SQL group by like query), tag clouds are tricky to implement efficiently because you need to count all tags you have manually. Which ever implementation you go with, what I would suggest for the tag cloud is to have a separate model TagCounter that keeps track of how many tags you have. Otherwise the tag query could get expensive if you have a lot of them.
class TagCounter:
tag = db.ReferenceProperty(Tag)
counter = db.IntegerProperty(default=0)
Every time you choose to update your tags on an article, make sure you increment and decrement from this table accordingly.
As for selecting articles by a tag, the first implementation is sufficient (the second is overly complex imo).
class Article(db.Model):
tags = db.ListProperty(Tag)
#staticmethod
def select_by_tag(tag):
return Article.all().filter("tags", tag).run()
I have created a huge tag cloud * on GAEcupboard opting for the first solution:
class Post(db.Model):
title = db.StringProperty(required = True)
tags = db.ListProperty(str, required = True)
The tag class has a counter property that is updated each time a new post is created/updated/deleted.
class Tag(db.Model):
name = db.StringProperty(required = True)
counter = db.IntegerProperty(required = True)
last_modified = db.DateTimeProperty(required = True, auto_now = True)
Having the tags organized in a ListProperty it's quite easy to offer a drill-down feature that allows user to compose different tags to search for the desired articles:
Example:
http://www.gaecupboard.com/tag/python/web-frameworks
The search is done using:
posts = Post.all()
posts.filter('tags', 'python').filter('tags', 'web-frameworks')
posts.fetch()
that does not need any custom index at all.
ok, it's too huge, I know :)
Creating a tag-cloud in app-engine is really difficult because the datastore doesn't support the GROUP BY construct normally used to express that; Nor does it supply a way to order by the length of a list property.
One of the key insights is that you have to show a tag cloud frequently, but you don't have to create one except when there are new articles, or articles get retagged, since you'll get the same tag-clout in any case; In fact, the tag cloud doesn't change very much for each new article, maybe a tag in the cloud becomes a little larger or a little smaller, but not by much, and not in a way that would affect its usefullness.
This suggests that tag-clouds should be created periodically, cached, and displayed much like static content. You should think about doing that in the Task Queue API.
The other query, listing articles by tag, would be utterly unsupported by the first techinque you've shown; Inverting it, having a Tag model with an articles ListProperty does support the query, but will suffer from update contention when popular tags have to get added to it at a high rate. The other technique, using an association model, suffers from neither of these concerns, but makes it harder to make the article listing queries convenient.
The way I would deal with this is to start with the ArticlesAndTags model, but add some additional data to the model to have a useful ordering; an article date, article title, whatever makes sense for the particular kind of site you're making. You'll also need a monotonic sequence (say, a timestamp) on this so you know when the tag applied.
The tag cloud query would be supported using a Tag entity that has Only a numeric article count, and also a reference to the same timestamp used in the ArticlesAndTags Model.
A task queue can then query for the 1000 oldest ArticlesAndTags that are newer than oldest Tag, sum the frequencies of each and add it to the counts in the Tags. Tag removals are probably rare enough that they can update the Tag model immediately without too much contention, but if that assumption turns out to be wrong, then delete events should be added to the ArticlesAndTags as well.
You don't seem to have very specific/complex requirements so my opinion is it's likely neither method would show significant benefits, or rather, the pros/cons will depend completely on what you're used to, how you want to structure your code, and how you implement caching and counting mechanisms.
The things that come to mind for me are:
-The ListProperty method leaves the data models looking more natural.
-The AtriclesAndTags method will mean you'd have to query for the relationships and then the Articles (ugh..), instead of doing Article.all().filter('tags =', tag).
Related
I am coming from .NET Core and I am curious if Django has anything similar to .NET Core's projections. For instance, I can describe a relationship in a .NET Core model and then later query for it. So, if Articles can have an Author I can do something like:
var articles = dbContext.Where(article.ID == id).Inclue(a => a.author);
What I would get back are articles that have their author attached.
Is there anything similar to this in Django? How can I load related data in Django that's described in the model?
Yes. The query you wrote is more or less equivalent to:
Article.objects.filter(id=some_id).prefetch_related('author')
or:
Article.objects.filter(id=some_id).select_related('author')
select_related versus prefetch_related
in case the number of Authors is limited, or more a one-to-one relation. In case you pull a large number of Articles, and several Articles map to the same Author, it is usually better to use prefetch_related: this will first look for the Author identifiers, perform a uniquness filter, and then fetch those into memory.
The same holds if multiple Authors write one article (so a one-to-many or many-to-many relation). Since that would mean that if we perform a JOIN at the database level, we repeat every article by the number of Authors who wrote that article, and we repeat every Author by the number of Articles they wrote. We usually want to avoid this "multiplicative" behavior for such sets. So in that case prefetch_related will have linear behavior: first fetching the relevant Articles, next fetching the relevant Authors.
Lazy loading of related objects
But you actually do not need to perform a prefetch_related for a single instance. In case you load an article, you can simply use some_article.author. If the corresponding Author instance is not yet loaded, Django will perform an additional query fetching the related Author instance.
So Django can load attributes that correspond to related objects in a lazy manner: it simply loads the Article if you fetch it in memory, and if you later need the Author, or the Journal, or the .editor of the Journal (which is for example an Author as well), Django will each time make a new query and load related object(s). In case you want however to process a list of Articles in batch, select_related and prefect_related are advisable, since they will result in a limited number of queries to fetch all related objects, instead of one query per related instance.
The lazy loading of related objects can be more efficient if one frequently has to fetch zero or at most a few related instances (for example because it depends on some attributes of the Article whether we are really interested in the Author after all).
Sounds like you are looking for select_related. This traverses FK relationships based on how they are created in your models.
I want to have a recommendation API for my app where a user can get a suggestion for an object they haven't yet seen, and I'm having trouble figuring out how I'm supposed to structure my data to make such a query efficient.
Here's an example using books. Suppose this is what I have as my models, the most normalized way possible:
def User(ndb.Model):
name = ndb.StringProperty()
def Book(ndb.Model):
title = ndb.StringProperty()
def Review(ndb.Model):
user = ndb.KeyProperty(User)
book = ndb.KeyProperty(Book)
stars = ndb.IntegerProperty()
text = ndb.TextProperty()
Now given a user, I want to retrieve a book that the user hasn't reviewed and this seems basically impossible to do efficiently and at scale (e.g. 50k users, 100k books).
I've read around and I realize that I should denormalize my data somehow, but for the life of me, I can't figure out a good way to do it. I've thought of putting Review as a StructuredProperty inside of Book, but I don't think that really buys me very much, and it means that I'll be limited by the number of reviews I can add to a book (because of the size limit for entries).
The other things I've seen mentioned a lot when other people asked similar questions are ancestors and ComputedProperty, but I don't really see how they help me here either.
Surely it's not actually impossible, and I just have a weak understanding of the best practices, right?
A useful de-normalization might be to add to User the list of books they've reviewed:
def User(ndb.Model):
name = ndb.StringProperty()
seen = ndb.KeyProperty('Book', repeated=True)
and to 'Book' an overall "score" whereby you'll want to order queries:
def Book(ndb.Model):
title = ndb.StringProperty()
score = ndb.IntegerProperty()
as usual the cost of de-normalization comes when writing -- in addition to creating a new review you'll also need to update the User and Book entities (and you may need a transaction, thus, entity groups, if e.g several users may be reviewing a book at the same time, but I'm skipping that part:-).
The advantage is that, when needing to propose a new book to a given User, you can query for Book (keys-only, sorted by score), with a cursor (or just a loop on the query) to "page through" the query's results, and just reject keys that, as you can check in-memory, are already in the given User's seen property.
Upon getting the User entity for the purpose, you can turn the seen into a set, so the checks will be very fast. This assumes that a user won't review more than a few thousand books, so everything needed should nicely fit in memory...
Say I have RootEntity, AEntity(child of RootEntity), BEntity(child of AEntity).
class RootEntity(ndb.Model):
rtp = ndb.StringProperty()
class AEntity(ndb.Model):
ap = ndb.IntegerProperty()
class BEntity(ndb.Model):
bp = ndb.StringProperty()
So in different handlers I need to get instances of BEntity with specific ancestor(instance of AEntity).
There is a my query: BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", AEntity.query(ancestor = ndb.Key("RootEntity", 1)).filter(AEntity.ap == int(some_value)).get().key.integer_id()))
How I can to optimize this query? Make it better, may be less sophisticated?
Upd:
This query is a part of function with #ndb.transactional decorator.
You should not use Entity Groups to represent entity relationships.
Entity groups have a special purpose: to define the scope of transactions. They give you ability to update multiple entities transactionally, as long as they are a part of the same entity group (this limitation has been somewhat relaxed with the new XG transactions). They also allow you to use queries within transactions (not available via XG transactions).
The downside of entity groups is that they have an update limitation of 1 write/second.
In your case my suggestion would be to use separate entities and make references between them. The reference should be a Key of the referenced entity as this is type-safe.
Regarding query simplicity: GAE unfortunately does not support JOINs or reference (multi-entity) queries, so you would still need to combine multiple queries together (as you do now).
There is a give and take with ancestor queries. They are a more verbose and messy to deal with but you get a better structure to your data and consistency in your queries.
To simplify this, if your handler knows the BEntity you want to get, just pass around the key.urlsafe() encoded key, it already has all of your ancestor information encoded.
If this is not possible, try possibly restructuring your data. Since these objects are all of the same ancestor, they belong to the same entity group, thus at most you can insert/update ~1 time per second for objects in that entity group. If you require higher throughput or do not require consistent ancestral queries, then try using ndb.KeyProperty to link entities with a reference to a parent rather than as an ancestor. Then you'd only need to get a single parent to query on rather than the parent and the parent's parent.
You should also try and use IDs whenever possible, so you can avoid having to filter for entities in your datastore by properties and just reference them by ID:
BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", int(some_value)))
Here, int(some_value) is the integer ID of the AEntity you used when you created that object. Just be sure that you can ensure the IDs you manually create/use will be unique across all instances of that Model that share the same parent.
EDIT:
To clarify, my last example should have been made more clear in that I was suggesting to restructure the data such that int(some_value) be used as the integer ID of the AEntity rather than storing is as a separate property of the Entity - if possible of course. From the example given, a query is performed for the AEntity objects that have a given integer field value of int(some_value) and executed with a get() - implying that you will always expect a single value return for that integer ID making it a good candidate to use as the integer ID for the key of that object eliminating the need for a query.
In may app, I have the following process:
Get a very long list of people
Create an entity for each person
Send an email to each person (step 2 must be completed before step 3 starts)
Because the list of people is very large, I don't want to put them in the same entity group.
In doing step 3, I can query the list of people like this:
Person.all()
Because of eventual consistency, I might miss some people in step 3. What is a good way to ensure that I am not missing anyone in step 3?
Is there a better solution than this?:
while Person.all().count() < N:
pass
for p in Person.all()
# do whatever
EDIT:
Another possible solution came to mind. I could create a linked list of the people. I can store a link to the first one, he can link to the second one and so one. It seems that the performance would be poor however, because you'd be doing each get separately and wouldn't have the efficiencies of a query.
UPDATE: I reread your post and saw that you don't want to put them all in the same entity group. I'm not sure how to guarantee strong consistency without doing so. You might want to restructure your data so that you don't have to put them in the same entity group, but in several. Perhaps depending on some aspect of a group of Person entities? (e.g., mailing list they are on, type of email being sent, etc.) Does each Person only contain a name and an email address, or are there other properties involved?
Google suggests a a few other alternatives:
If your application is likely to encounter heavier write usage, you may need to consider using other means: for example, you might put recent posts in a memcache with an expiration and display a mix of recent posts from the memcache and the Datastore, or you might cache them in a cookie, put some state in the URL, or something else entirely. The goal is to find a caching solution that provides the data for the current user for the period of time in which the user is posting to your application. Remember, if you do a get, a put, or any operation within a transaction, you will always see the most recently written data.
So it looks like you may want to investigate those possibilities, although I'm not sure how well they would translate to what your app needs.
ORIGINAL POST: Use ancestor queries.
From Google's "Structuring Data for Strong Consistency":
To obtain strongly consistent query results, you need to use an ancestor query limiting the results to a single entity group. This works because entity groups are a unit of consistency as well as transactionality. All data operations are applied to the entire group; an ancestor query won't return its results until the entire entity group is up to date. If your application relies on strongly consistent results for certain queries, you may need to take this into consideration when designing your data model. This page discusses best practices for structuring your data to support strong consistency.
So when you create a Person entity, set a parent for it. I believe you could even just have a specific entity be the "parent" of all the others, and it should give you strong consistency. (Although I like to structure my data a bit with ancestors anyway.)
# Gives you the ancestor key
def ancestor_key(kind, id_or_name):
return db.Key.from_path(kind, id_or_name)
# Kind is the db model your using (should be 'Person' in this case) and
# id_or_name should be the key id or name for the parent
new_person = Person(your_params, parent=ancestor_key('Kind', id_or_name)
You could even do queries at that point for all the entities with the same parent, which is nice. But that should help you get more consistent results regardless.
I have a lot of (e.g.) posts, that marked with one or more tags. Post can be created or deleted, and also user can make search request for one or more tags (combined with logical AND).
First idea that came to my mind was a simple model
class Post(db.Model):
#blahblah
tags = db.StringListProperty()
Implementation of create and delete operations is obvious. Search is more complex. To search for N tags it will do N GQL queries like "SELECT * FROM Post WHERE tags = :1" and merge the results using the cursors, and it has terrible performance.
Second idea is to separate tags in different entities
class Post(db.Model):
#blahblah
tags = db.ListProperty(db.Key) # For fast access
class Tag(db.Model):
name = db.StringProperty(name="key")
posts = db.ListProperty(db.Key) # List of posts that marked with tag
It takes Tags from db by key (much faster than take it by GQL) and merge it in memory, I think this implementation has a better performance than the first one, but very frequently usable tags can exceed maximal size that allowed for single datastore object. And there is another problem: datastore can modify one single object only ~1/sec, so for frequently usable tags we also have a bottleneck with modify latency.
Any suggestions?
To further Nick's questioning. If it is a logical AND using multiple tags in they query. Use tags = tag1 AND tags = tag2 ... set membership in a single query is one of datastore's shining features. You can achieve your result in one query.
http://code.google.com/appengine/docs/python/datastore/queriesandindexes.html#Properties_With_Multiple_Values
Probably a possible solution is to take your second example, and modify it in a way that would permit efficient queries on larger sets. One way that springs to mind is to use multiple database entities for a single tag, and group them in such a way as you would seldom need to get more than a few groups. If the default sort order (well lets just call it the only permitted) is by post-date, then fill the tag group entities in that order.
class Tag(db.Model):
name = db.StringProperty(name="key")
posts = db.ListProperty(db.Key) # List of posts that marked with tag
firstpost = db.DateTimeProperty()
When adding or removing tags to a group, check to see how many posts are in that group, if the post you are adding would make the post have more than, say 100 posts, split it into two tag groups. If you are removing a post so that the group would have fewer than 50 posts, steal some posts from a previous or next group. If one of the adjacent groups has 50 posts also, just merge them together. When listing posts by tag (in post-date order), you need only get a handful of groups.
That doesn't really resolve the high-demand tag problem.
Thinking about it, it might be okay for inserts to be a bit more speculative. Get the latest tag group entries, merge them and place a new tag group. The lag in the transactions might actually not be a real problem.