I'm trying to build some kind of KPI into my site, and struggling on how to retrive the data.
for example, let's say I'm building a blog with a model of :
class MyPost(ndb.Model):
Author = ndb.KeyProperty(MyUser, required = True)
when = TZDateTimeProperty(required = True)
status = ndb.IntegerProperty(default = 1) # 1=draft, 2=published
text = ndb.TextProperty()
and I want to build a query that would list my top authors that would give me a result of (preferably sorted)
['Jack':10, 'Jane':8, 'Joe',0]
I can think of 2 ways:
query().fetch() all items and manually count then
this is very un-efficient but most flexible
for author in Users: result[author]=query(...).fetch().count()
so-so efficiency, and requires to know my indexes in advance (would not work if I want to query by "author's favorit pet"
which one is preferable ?
what other methods would you recommend ?
I'd recommend de-normalizing the MyUser model, that is, introducing redundancy, by giving MyUser an IntegerProperty, say numposts, that redundantly keeps track of how many MyPost entities that user has authored. The need to de-normalize is frequent in NoSQL data stores.
The price you pay for this modest de-normalization is that adding a new post requires more work, since you also need to increment the author's numposts when that happens. However, more often than not, a data store is "read mostly" -- addition of new entities is comparatively rare compared to querying existing ones. The purpose of de-normalization is to make the latter activity vastly more efficient, for important queries, at a modest cost to the former activity.
Related
I want to have a recommendation API for my app where a user can get a suggestion for an object they haven't yet seen, and I'm having trouble figuring out how I'm supposed to structure my data to make such a query efficient.
Here's an example using books. Suppose this is what I have as my models, the most normalized way possible:
def User(ndb.Model):
name = ndb.StringProperty()
def Book(ndb.Model):
title = ndb.StringProperty()
def Review(ndb.Model):
user = ndb.KeyProperty(User)
book = ndb.KeyProperty(Book)
stars = ndb.IntegerProperty()
text = ndb.TextProperty()
Now given a user, I want to retrieve a book that the user hasn't reviewed and this seems basically impossible to do efficiently and at scale (e.g. 50k users, 100k books).
I've read around and I realize that I should denormalize my data somehow, but for the life of me, I can't figure out a good way to do it. I've thought of putting Review as a StructuredProperty inside of Book, but I don't think that really buys me very much, and it means that I'll be limited by the number of reviews I can add to a book (because of the size limit for entries).
The other things I've seen mentioned a lot when other people asked similar questions are ancestors and ComputedProperty, but I don't really see how they help me here either.
Surely it's not actually impossible, and I just have a weak understanding of the best practices, right?
A useful de-normalization might be to add to User the list of books they've reviewed:
def User(ndb.Model):
name = ndb.StringProperty()
seen = ndb.KeyProperty('Book', repeated=True)
and to 'Book' an overall "score" whereby you'll want to order queries:
def Book(ndb.Model):
title = ndb.StringProperty()
score = ndb.IntegerProperty()
as usual the cost of de-normalization comes when writing -- in addition to creating a new review you'll also need to update the User and Book entities (and you may need a transaction, thus, entity groups, if e.g several users may be reviewing a book at the same time, but I'm skipping that part:-).
The advantage is that, when needing to propose a new book to a given User, you can query for Book (keys-only, sorted by score), with a cursor (or just a loop on the query) to "page through" the query's results, and just reject keys that, as you can check in-memory, are already in the given User's seen property.
Upon getting the User entity for the purpose, you can turn the seen into a set, so the checks will be very fast. This assumes that a user won't review more than a few thousand books, so everything needed should nicely fit in memory...
In may app, I have the following process:
Get a very long list of people
Create an entity for each person
Send an email to each person (step 2 must be completed before step 3 starts)
Because the list of people is very large, I don't want to put them in the same entity group.
In doing step 3, I can query the list of people like this:
Person.all()
Because of eventual consistency, I might miss some people in step 3. What is a good way to ensure that I am not missing anyone in step 3?
Is there a better solution than this?:
while Person.all().count() < N:
pass
for p in Person.all()
# do whatever
EDIT:
Another possible solution came to mind. I could create a linked list of the people. I can store a link to the first one, he can link to the second one and so one. It seems that the performance would be poor however, because you'd be doing each get separately and wouldn't have the efficiencies of a query.
UPDATE: I reread your post and saw that you don't want to put them all in the same entity group. I'm not sure how to guarantee strong consistency without doing so. You might want to restructure your data so that you don't have to put them in the same entity group, but in several. Perhaps depending on some aspect of a group of Person entities? (e.g., mailing list they are on, type of email being sent, etc.) Does each Person only contain a name and an email address, or are there other properties involved?
Google suggests a a few other alternatives:
If your application is likely to encounter heavier write usage, you may need to consider using other means: for example, you might put recent posts in a memcache with an expiration and display a mix of recent posts from the memcache and the Datastore, or you might cache them in a cookie, put some state in the URL, or something else entirely. The goal is to find a caching solution that provides the data for the current user for the period of time in which the user is posting to your application. Remember, if you do a get, a put, or any operation within a transaction, you will always see the most recently written data.
So it looks like you may want to investigate those possibilities, although I'm not sure how well they would translate to what your app needs.
ORIGINAL POST: Use ancestor queries.
From Google's "Structuring Data for Strong Consistency":
To obtain strongly consistent query results, you need to use an ancestor query limiting the results to a single entity group. This works because entity groups are a unit of consistency as well as transactionality. All data operations are applied to the entire group; an ancestor query won't return its results until the entire entity group is up to date. If your application relies on strongly consistent results for certain queries, you may need to take this into consideration when designing your data model. This page discusses best practices for structuring your data to support strong consistency.
So when you create a Person entity, set a parent for it. I believe you could even just have a specific entity be the "parent" of all the others, and it should give you strong consistency. (Although I like to structure my data a bit with ancestors anyway.)
# Gives you the ancestor key
def ancestor_key(kind, id_or_name):
return db.Key.from_path(kind, id_or_name)
# Kind is the db model your using (should be 'Person' in this case) and
# id_or_name should be the key id or name for the parent
new_person = Person(your_params, parent=ancestor_key('Kind', id_or_name)
You could even do queries at that point for all the entities with the same parent, which is nice. But that should help you get more consistent results regardless.
In my application, we need to develop a FRIENDS relationship table in datastore. And of course a quick solution I've thought of would be this:
user = db.ReferenceProperty(User, required=True, collection_name='user')
friend = db.ReferenceProperty(User, required=True, collection_name='friends')
But, what would happen when the friend list grows to a huge number, say few thousands or more ? Will this be too inefficient ?
Performance is always a priority to us. This is very much needed, as we would have few more to follow this similar relationship design.
Please give advice on the best approach to design for FRIENDS relationship table using datastore within App Engine Python environment.
EDIT
Other than FRIENDS relationship, FOLLOWER relationship will be created as well. And I believe it will be very often enough all these relationship to be queries most of the time, for the reason social media oriented of my application tend to be.
For example, If I follow some users, I will get update as news feed on what they will be doing etc. And the activities will be increased over time. As for how many users, I can't answer yet as we haven't go live. But I foresee to have millions of users as we go on.
Hopefully, this would help for more specific advice or is there alternative to this approach ?
Your FRIENDS model (and presumably also your FOLLOWERS model) should scale well. The tricky part in your system is actually aggregating the content from all of a user's friends and followees.
Querying for a list of a user's is O(N), where N is the number of friends, due to the table you've described in your post. However, each of those queries requires another O(N) operation to retrieve content that the friend has shared. This leads to O(N^2) each time a user wants to see recent content. This particular query is bad for two reasons:
An O(N^2) operation isn't what you want to see in your core algorithm when designing a system for millions of users.
App Engine tends to limit these kinds of queries. Specifically, the IN keyword you'd need to use in order to grab the list of shared items won't work for more than 30 friends.
For this particular problem, I'd recommend creating another table that links each user to each piece of shared content. Something like this:
class SharedItems(db.Model):
user = db.ReferenceProperty(User, required=True) # logged-in user
from = db.ReferenceProperty(User, required=True) # who shared it
item = db.ReferenceProperty(Item, required=True) # the item itself
posted = db.DateTimeProperty() # when it was shared
When it comes time to render the stream of updates, you need an O(N) query (N is the number of items you want to display) to look up all the items shared with the user (ordered by date descending). Keep N small to keep this as fast as possible.
Sharing an item requires creating O(N) SharedItems where N is the number of friends and followers the poster has. If this number is too large to handle in a single request, shard it out to a task queue or backend.
propertylist are a great way to get cheap/simple indexing in GAE.
but as u have correctly identified there is a few limitations.
the index size of the entire entity is limited (i think currently 5000). So each propertyList value will require an index. so basically propertylist size <4999
serialisation of such a large propertylist is expensive!!
bring back a 2Mb entity is slow... and will cost CPU.
if expecting a large propertyIndex then dont do it.
the alternative is to create a JOIN table that models the relationship
class Friends(db.Model):
user = db.ReferenceProperty(User, required=True) # logged-in user
from = db.ReferenceProperty(User, required=True) # who shared it
just a entity with 2 keys.
this allows for simple querying to find all friends for user.
select from friends where user = : me
find all user where i am the friend.
select from friends where friend = : me
since it returns a key, u can do a bulk get(keylist) to fetch the actual friends details.
There is going to be "articles" and "tags" in my App Engine application.
And there are two techniques to implement that (thanks to Nick Johnson's article):
# one entity just refers others
class Article(db.Model):
tags = db.ListProperty(Tag)
# via separate "join" table
class ArticlesAndTags(db.Model):
article = db.ReferenceProperty(Article)
tag = db.ReferenceProperty(Tag)
Which one should I prefer according to the following tasks?
Create tag cloud (frequently),
Select articles by a tag (rather rarely)
Because of the lack of a 'reduce' feature in appengine's map reduce (nor an SQL group by like query), tag clouds are tricky to implement efficiently because you need to count all tags you have manually. Which ever implementation you go with, what I would suggest for the tag cloud is to have a separate model TagCounter that keeps track of how many tags you have. Otherwise the tag query could get expensive if you have a lot of them.
class TagCounter:
tag = db.ReferenceProperty(Tag)
counter = db.IntegerProperty(default=0)
Every time you choose to update your tags on an article, make sure you increment and decrement from this table accordingly.
As for selecting articles by a tag, the first implementation is sufficient (the second is overly complex imo).
class Article(db.Model):
tags = db.ListProperty(Tag)
#staticmethod
def select_by_tag(tag):
return Article.all().filter("tags", tag).run()
I have created a huge tag cloud * on GAEcupboard opting for the first solution:
class Post(db.Model):
title = db.StringProperty(required = True)
tags = db.ListProperty(str, required = True)
The tag class has a counter property that is updated each time a new post is created/updated/deleted.
class Tag(db.Model):
name = db.StringProperty(required = True)
counter = db.IntegerProperty(required = True)
last_modified = db.DateTimeProperty(required = True, auto_now = True)
Having the tags organized in a ListProperty it's quite easy to offer a drill-down feature that allows user to compose different tags to search for the desired articles:
Example:
http://www.gaecupboard.com/tag/python/web-frameworks
The search is done using:
posts = Post.all()
posts.filter('tags', 'python').filter('tags', 'web-frameworks')
posts.fetch()
that does not need any custom index at all.
ok, it's too huge, I know :)
Creating a tag-cloud in app-engine is really difficult because the datastore doesn't support the GROUP BY construct normally used to express that; Nor does it supply a way to order by the length of a list property.
One of the key insights is that you have to show a tag cloud frequently, but you don't have to create one except when there are new articles, or articles get retagged, since you'll get the same tag-clout in any case; In fact, the tag cloud doesn't change very much for each new article, maybe a tag in the cloud becomes a little larger or a little smaller, but not by much, and not in a way that would affect its usefullness.
This suggests that tag-clouds should be created periodically, cached, and displayed much like static content. You should think about doing that in the Task Queue API.
The other query, listing articles by tag, would be utterly unsupported by the first techinque you've shown; Inverting it, having a Tag model with an articles ListProperty does support the query, but will suffer from update contention when popular tags have to get added to it at a high rate. The other technique, using an association model, suffers from neither of these concerns, but makes it harder to make the article listing queries convenient.
The way I would deal with this is to start with the ArticlesAndTags model, but add some additional data to the model to have a useful ordering; an article date, article title, whatever makes sense for the particular kind of site you're making. You'll also need a monotonic sequence (say, a timestamp) on this so you know when the tag applied.
The tag cloud query would be supported using a Tag entity that has Only a numeric article count, and also a reference to the same timestamp used in the ArticlesAndTags Model.
A task queue can then query for the 1000 oldest ArticlesAndTags that are newer than oldest Tag, sum the frequencies of each and add it to the counts in the Tags. Tag removals are probably rare enough that they can update the Tag model immediately without too much contention, but if that assumption turns out to be wrong, then delete events should be added to the ArticlesAndTags as well.
You don't seem to have very specific/complex requirements so my opinion is it's likely neither method would show significant benefits, or rather, the pros/cons will depend completely on what you're used to, how you want to structure your code, and how you implement caching and counting mechanisms.
The things that come to mind for me are:
-The ListProperty method leaves the data models looking more natural.
-The AtriclesAndTags method will mean you'd have to query for the relationships and then the Articles (ugh..), instead of doing Article.all().filter('tags =', tag).
I have a quite common design problem: I need to implement a history log (audit trail) for records in Google App Engine. The history log has to be structured, i.e I cannot join all changes into some free-form text and store in string field.
I've considered the following options for the history model and, after noticing performance issues in option #1, I've chosen to implement option #3. But have stil some doubts if this solution is efficient and scalable. For instance: is there a risk that performance will degrade significantly with increased number of dynamic properties in option #3?
Do you have some deeper knowledge on the pros/cons for each option or could suggest other audit trail design patterns applicable for Google App Engine DB characteristics?
Use classic SQL "master-detail" relation
Pros
simple to understand for database developers with SQL background
clean: direct definition for history record and its properties
search performance: easy searching through history (can use indices)
troubleshooting: easy access by administration tools (_ah/admin)
Cons
one-to-many relations are often not recommended to be implemented this way in GAE DB
read performance: excessive number of record read operations to show long audit trail e.g. in details pane of a big records list.
Store history in a BLOB field (pickled python structures)
Pros
simple to implement and flexible
read performance: very efficient
Cons
query performance: cannot search using indices
troubleshooting: cannot inspect data by admin db viewer (_ah/admin)
unclean: not so easy to understand/accept for SQL developers (they consider this ugly)
Store history in Expando's dynamic properties. E.g. for each field fieldName create history_fieldName_n fields (where n=<0..N> is a number of history record)
Pros:
simple: simple to implement and understand
troubleshooting: can read all the history properties through admin interface
read performance: one read operation to get the record
Cons:
search performance: cannot simply search through history records (they have different name)
not too clean: number of properties may be confusing at first look
Store history in some set of list fields in the main record. Eg. for each fieldName create a fieldName_history list field
Pros:
clean: direct definition of history properties
simple: easy to understand for SQL developers
read performance: one read operation to get the record
Cons:
search performance: can search using indices only for records which whenever had some value and cannot search for records having combination of values at some particular time;
troubleshooting: inspecting lists is difficult in admin db viewer
If I would have to choose I would go for option 1. The reads are as (if not more) performant for the other options. And all other options only have speed advantages under specific circumstances (small or very large sets of changes). It will also get you lots of flexibility (with more ease) like purging history after x days or query history across different model types. Make sure you create the history entities as a child of the changed entity in the same transaction to guarantee consistency. You could end up with one of these:
class HistoryEventFieldLevel(db.Model):
# parent, you don't have to define this
date = db.DateTime()
model = db.StringProperty()
property = db.StringProperty() # Name of changed property
action = db.EnumProperty(['insert', 'update', 'delete'])
old = db.PickleProperty() # Old value for field, empty on insert
new = db.PickleProperty() # New value for field, empty on delete
class HistoryEventModelLevel(db.Model):
# parent, you don't have to define this
date = db.DateTime()
model = db.StringProperty()
action = db.EnumProperty(['insert', 'update', 'delete'])
change = db.PickleProperty() # Dictionary with changed fields as keys and tuples (old value, new value) as values