Structuring NDB models to make an "antijoin" query possible

Structuring NDB models to make an "antijoin" query possible - python

I want to have a recommendation API for my app where a user can get a suggestion for an object they haven't yet seen, and I'm having trouble figuring out how I'm supposed to structure my data to make such a query efficient.
Here's an example using books. Suppose this is what I have as my models, the most normalized way possible:
def User(ndb.Model):
name = ndb.StringProperty()
def Book(ndb.Model):
title = ndb.StringProperty()
def Review(ndb.Model):
user = ndb.KeyProperty(User)
book = ndb.KeyProperty(Book)
stars = ndb.IntegerProperty()
text = ndb.TextProperty()
Now given a user, I want to retrieve a book that the user hasn't reviewed and this seems basically impossible to do efficiently and at scale (e.g. 50k users, 100k books).
I've read around and I realize that I should denormalize my data somehow, but for the life of me, I can't figure out a good way to do it. I've thought of putting Review as a StructuredProperty inside of Book, but I don't think that really buys me very much, and it means that I'll be limited by the number of reviews I can add to a book (because of the size limit for entries).
The other things I've seen mentioned a lot when other people asked similar questions are ancestors and ComputedProperty, but I don't really see how they help me here either.
Surely it's not actually impossible, and I just have a weak understanding of the best practices, right?

A useful de-normalization might be to add to User the list of books they've reviewed:
def User(ndb.Model):
name = ndb.StringProperty()
seen = ndb.KeyProperty('Book', repeated=True)
and to 'Book' an overall "score" whereby you'll want to order queries:
def Book(ndb.Model):
title = ndb.StringProperty()
score = ndb.IntegerProperty()
as usual the cost of de-normalization comes when writing -- in addition to creating a new review you'll also need to update the User and Book entities (and you may need a transaction, thus, entity groups, if e.g several users may be reviewing a book at the same time, but I'm skipping that part:-).
The advantage is that, when needing to propose a new book to a given User, you can query for Book (keys-only, sorted by score), with a cursor (or just a loop on the query) to "page through" the query's results, and just reject keys that, as you can check in-memory, are already in the given User's seen property.
Upon getting the User entity for the purpose, you can turn the seen into a set, so the checks will be very fast. This assumes that a user won't review more than a few thousand books, so everything needed should nicely fit in memory...

Related

how to query by sum in ndb

I'm trying to build some kind of KPI into my site, and struggling on how to retrive the data.
for example, let's say I'm building a blog with a model of :
class MyPost(ndb.Model):
Author = ndb.KeyProperty(MyUser, required = True)
when = TZDateTimeProperty(required = True)
status = ndb.IntegerProperty(default = 1) # 1=draft, 2=published
text = ndb.TextProperty()
and I want to build a query that would list my top authors that would give me a result of (preferably sorted)
['Jack':10, 'Jane':8, 'Joe',0]
I can think of 2 ways:
query().fetch() all items and manually count then
this is very un-efficient but most flexible
for author in Users: result[author]=query(...).fetch().count()
so-so efficiency, and requires to know my indexes in advance (would not work if I want to query by "author's favorit pet"
which one is preferable ?
what other methods would you recommend ?

I'd recommend de-normalizing the MyUser model, that is, introducing redundancy, by giving MyUser an IntegerProperty, say numposts, that redundantly keeps track of how many MyPost entities that user has authored. The need to de-normalize is frequent in NoSQL data stores.
The price you pay for this modest de-normalization is that adding a new post requires more work, since you also need to increment the author's numposts when that happens. However, more often than not, a data store is "read mostly" -- addition of new entities is comparatively rare compared to querying existing ones. The purpose of de-normalization is to make the latter activity vastly more efficient, for important queries, at a modest cost to the former activity.

FRIENDS Table Datastore Design + App Engine Python

In my application, we need to develop a FRIENDS relationship table in datastore. And of course a quick solution I've thought of would be this:
user = db.ReferenceProperty(User, required=True, collection_name='user')
friend = db.ReferenceProperty(User, required=True, collection_name='friends')
But, what would happen when the friend list grows to a huge number, say few thousands or more ? Will this be too inefficient ?
Performance is always a priority to us. This is very much needed, as we would have few more to follow this similar relationship design.
Please give advice on the best approach to design for FRIENDS relationship table using datastore within App Engine Python environment.
EDIT
Other than FRIENDS relationship, FOLLOWER relationship will be created as well. And I believe it will be very often enough all these relationship to be queries most of the time, for the reason social media oriented of my application tend to be.
For example, If I follow some users, I will get update as news feed on what they will be doing etc. And the activities will be increased over time. As for how many users, I can't answer yet as we haven't go live. But I foresee to have millions of users as we go on.
Hopefully, this would help for more specific advice or is there alternative to this approach ?

Your FRIENDS model (and presumably also your FOLLOWERS model) should scale well. The tricky part in your system is actually aggregating the content from all of a user's friends and followees.
Querying for a list of a user's is O(N), where N is the number of friends, due to the table you've described in your post. However, each of those queries requires another O(N) operation to retrieve content that the friend has shared. This leads to O(N^2) each time a user wants to see recent content. This particular query is bad for two reasons:
An O(N^2) operation isn't what you want to see in your core algorithm when designing a system for millions of users.
App Engine tends to limit these kinds of queries. Specifically, the IN keyword you'd need to use in order to grab the list of shared items won't work for more than 30 friends.
For this particular problem, I'd recommend creating another table that links each user to each piece of shared content. Something like this:
class SharedItems(db.Model):
user = db.ReferenceProperty(User, required=True) # logged-in user
from = db.ReferenceProperty(User, required=True) # who shared it
item = db.ReferenceProperty(Item, required=True) # the item itself
posted = db.DateTimeProperty() # when it was shared
When it comes time to render the stream of updates, you need an O(N) query (N is the number of items you want to display) to look up all the items shared with the user (ordered by date descending). Keep N small to keep this as fast as possible.
Sharing an item requires creating O(N) SharedItems where N is the number of friends and followers the poster has. If this number is too large to handle in a single request, shard it out to a task queue or backend.

propertylist are a great way to get cheap/simple indexing in GAE.
but as u have correctly identified there is a few limitations.
the index size of the entire entity is limited (i think currently 5000). So each propertyList value will require an index. so basically propertylist size <4999
serialisation of such a large propertylist is expensive!!
bring back a 2Mb entity is slow... and will cost CPU.
if expecting a large propertyIndex then dont do it.
the alternative is to create a JOIN table that models the relationship
class Friends(db.Model):
user = db.ReferenceProperty(User, required=True) # logged-in user
from = db.ReferenceProperty(User, required=True) # who shared it
just a entity with 2 keys.
this allows for simple querying to find all friends for user.
select from friends where user = : me
find all user where i am the friend.
select from friends where friend = : me
since it returns a key, u can do a bulk get(keylist) to fetch the actual friends details.

Choose between many-to-many techniques in App Engine DB

There is going to be "articles" and "tags" in my App Engine application.
And there are two techniques to implement that (thanks to Nick Johnson's article):
# one entity just refers others
class Article(db.Model):
tags = db.ListProperty(Tag)
# via separate "join" table
class ArticlesAndTags(db.Model):
article = db.ReferenceProperty(Article)
tag = db.ReferenceProperty(Tag)
Which one should I prefer according to the following tasks?
Create tag cloud (frequently),
Select articles by a tag (rather rarely)

Because of the lack of a 'reduce' feature in appengine's map reduce (nor an SQL group by like query), tag clouds are tricky to implement efficiently because you need to count all tags you have manually. Which ever implementation you go with, what I would suggest for the tag cloud is to have a separate model TagCounter that keeps track of how many tags you have. Otherwise the tag query could get expensive if you have a lot of them.
class TagCounter:
tag = db.ReferenceProperty(Tag)
counter = db.IntegerProperty(default=0)
Every time you choose to update your tags on an article, make sure you increment and decrement from this table accordingly.
As for selecting articles by a tag, the first implementation is sufficient (the second is overly complex imo).
class Article(db.Model):
tags = db.ListProperty(Tag)
#staticmethod
def select_by_tag(tag):
return Article.all().filter("tags", tag).run()

I have created a huge tag cloud * on GAEcupboard opting for the first solution:
class Post(db.Model):
title = db.StringProperty(required = True)
tags = db.ListProperty(str, required = True)
The tag class has a counter property that is updated each time a new post is created/updated/deleted.
class Tag(db.Model):
name = db.StringProperty(required = True)
counter = db.IntegerProperty(required = True)
last_modified = db.DateTimeProperty(required = True, auto_now = True)
Having the tags organized in a ListProperty it's quite easy to offer a drill-down feature that allows user to compose different tags to search for the desired articles:
Example:
http://www.gaecupboard.com/tag/python/web-frameworks
The search is done using:
posts = Post.all()
posts.filter('tags', 'python').filter('tags', 'web-frameworks')
posts.fetch()
that does not need any custom index at all.
ok, it's too huge, I know :)

Creating a tag-cloud in app-engine is really difficult because the datastore doesn't support the GROUP BY construct normally used to express that; Nor does it supply a way to order by the length of a list property.
One of the key insights is that you have to show a tag cloud frequently, but you don't have to create one except when there are new articles, or articles get retagged, since you'll get the same tag-clout in any case; In fact, the tag cloud doesn't change very much for each new article, maybe a tag in the cloud becomes a little larger or a little smaller, but not by much, and not in a way that would affect its usefullness.
This suggests that tag-clouds should be created periodically, cached, and displayed much like static content. You should think about doing that in the Task Queue API.
The other query, listing articles by tag, would be utterly unsupported by the first techinque you've shown; Inverting it, having a Tag model with an articles ListProperty does support the query, but will suffer from update contention when popular tags have to get added to it at a high rate. The other technique, using an association model, suffers from neither of these concerns, but makes it harder to make the article listing queries convenient.
The way I would deal with this is to start with the ArticlesAndTags model, but add some additional data to the model to have a useful ordering; an article date, article title, whatever makes sense for the particular kind of site you're making. You'll also need a monotonic sequence (say, a timestamp) on this so you know when the tag applied.
The tag cloud query would be supported using a Tag entity that has Only a numeric article count, and also a reference to the same timestamp used in the ArticlesAndTags Model.
A task queue can then query for the 1000 oldest ArticlesAndTags that are newer than oldest Tag, sum the frequencies of each and add it to the counts in the Tags. Tag removals are probably rare enough that they can update the Tag model immediately without too much contention, but if that assumption turns out to be wrong, then delete events should be added to the ArticlesAndTags as well.

You don't seem to have very specific/complex requirements so my opinion is it's likely neither method would show significant benefits, or rather, the pros/cons will depend completely on what you're used to, how you want to structure your code, and how you implement caching and counting mechanisms.
The things that come to mind for me are:
-The ListProperty method leaves the data models looking more natural.
-The AtriclesAndTags method will mean you'd have to query for the relationships and then the Articles (ugh..), instead of doing Article.all().filter('tags =', tag).

Database Design Inquiry

I'm making a trivia webapp that will feature both standalone questions, and 5+ question quizzes. I'm looking for suggestions for designing this model.
Should a quiz and its questions be stored in separate tables/objects, with a key to tie them together, or am I better off creating the quiz as a standalone entity, with lists stored for each of a question's characteristics? Or perhaps someone has another idea...
Thank you in advance. It would probably help to say that I am using Google App Engine, which typically frowns upon relational db models, but I'm willing to go my own route if it makes sense.

It's hard to say without more information, but having the following relations would be sensible, based on what you've said:
Quiz (id, title)
Question (id, question, answer)
QuizQuestion (quiz_id, question_id)
That way questions can appear in multiple quizzes.

My first cut (I assumed the questions were multiple choice):
I'd have a table of Questions, with ID_Question as the PK, the question text, and a category (if you want).
I'd have a table of Answers, with ID_Answer as the PK, QuestionID as a FK back to the Questions table, the answer text, and a flag as to whether it's the correct answer or not.
I'd have a table of Quizzes, with ID_Quiz as the PK, and a description of the quiz, and a category (if you want).
I'd have a table of QuizQuestions, with ID_QuizQuestion as the PK, QuizID as a FK back to the Quizzes table, and QuestionID as a FK back to the Questions table.
This model lets you:
Use questions standalone or in quizzes
Lets you have as many or few questions in a quiz as you want
Lets you have as many of few choices for questions as you want (or even multiple correct answers)
Use questions in several different quizzes

I recently completed an App Engine app for taking personality quizzes.
I would say go the super simple route and store everything about each quiz in a single Quiz entity. If you don't need to reuse questions between quizzes, don't need to search or in any other way access the structure of a quiz besides taking the quiz, you could simply do:
class Quiz(db.Model):
data = db.TextProperty(default=None)
Then data can be a JSON structure like:
data = {
"title" : "Capitals quiz",
"questions" : [
{
"text" : "What is the capital of Finland?"
"options" : ["Stockholm, Helsinki, London"],
"correct" : 1
}
...
]
}
Things for which you want indexes you will want to leave out of this data structure. For example in my app I found I need to leave ID of the quiz creator outside the data so that I can make a data store query for all quizzes created by a certain person. Also I have creation date outside of the data, so that I can query for latest quizzes.
created = db.DateTimeProperty(auto_now_add=True)
You might have some other fields like this. Like I said this is a very simple way to store quizzes without needing to have multiple data store entities or queries for a quiz. However it has worked well in practice in my own personality tests app and is an increasingly popular way of storing data.

Here's a design that will account for having a many-to-many relationship between Quizess and Questions. In other words, a Quiz can have many questions, and a Question can belong to many Quizzes. There is only one copy of a Question, so changes can be made to it that will then be reflected in each Quiz to which the Question belongs.
class Quiz(db.Model):
# Data specific to your Quiz: number, name, times shown, etc
questions = db.ListProperty(db.Key)
class Questions(db.Model):
question = db.StringProperty()
choices = db.StringListProperty() # List of possible anwsers
correct = db.IntegerpProperty() # index of string in choices that is correct
The questions property of a Quiz entity holds the Key of each Question entity which is assigned to the Quiz. Look up by key is speedy, and it allows any one Question to be assigned to any number of individual Quizzes.

Have a table of questions, a table of quizzes and a mapping table between them. That will give you the most flexibility. This is simple enough that you wouldn't even necessarily need a whole relational database management system. I think people tend to forget that relations are pretty simple mathematical/logical concepts. An RDBMS just handles a lot of the messy book keeping for you.

Possible Data Schemes for Achievements on Google App Engine

I'm building a flash game website using Google App Engine. I'm going to put achievements on it, and am scratching my head on how exactly to store this data. I have a (non-google) user model. I need to know how to store each kind of achievement (name, description, picture), as well as link them with users who earn them. Also, I need to keep track of when they earned it, as well as their current progress for those who don't have one yet.
If anyone wants to give suggestions about detection or other achievement related task feel free.
EDIT:
Non-Google User Model:
class BeardUser(db.Model):
email = db.StringProperty()
username = db.StringProperty()
password = db.StringProperty()
settings = db.ReferenceProperty(UserSettings)
is_admin = db.BooleanProperty()
user_role = db.StringProperty(default="user")
datetime = db.DateTimeProperty(auto_now_add=True)

Unless your users will be adding achievements dynamically (as in something like Kongregate where users can upload their own games, etc), your achievement list will be static. That means you can store the (name, description, picture) list in your main python file, or as an achievements module or something. You can reference the achievements in your dynamic data by name or id as specified in this list.
To indicate whether a given user has gotten an achievement, a simple way is to add a ListProperty to your player model that will hold the achievements that the player has gotten. By default this will be indexed so you can query for which players have gotten which achievements, etc.
If you also want to store what date/time the user has gotten the achievement, we're getting into an area where the built-in properties are less ideal. You could add another ListProperty with datetime properties corresponding to each achievement, but that's awkward: what we'd really like is a tuple or dict so that we can store the achievement id/name together with the time it was achieved, and make it easy to add additional properties related to each achievement in the future.
My usual approach for cases like this is to dump my ideal data structure into a BlobProperty, but that has some disadvantages and you may prefer a different approach.
Another approach is to have an achievement model separate from your user model, and a ListProperty full of referenceProperties to all the achievements that user has gotten. this has the advantage of letting you index everything very nicely but will be an additional API CPU cost at runtime especially if you have a lot of achievements to look up, and operations like deleting all a user's achievements will be very expensive compared to the above.

Building on Brandon's answer here.
If you can, it would be MUCH faster to define all the achievements in the Python files, so they are in-memory and require no database fetching. This is also simpler to implement.
class StaticAchievement(object):
"""
An achievement defined in a Python file.
"""
by_name = {}
by_index = []
def __init__(self, name, description="", picture=None):
if picture is None: picture = "static/default_achievement.png"
StaticAchievement.by_name[name] = self
StaticAchievement.by_index.append(self)
self.index = len(StaticAchievement.by_index)
# This automatically adds an entry to the StaticAchievement.by_name dict.
# It also adds an entry to to the StaticAchievement.by_index list.
StaticAchievement(
name="tied your shoe",
description="You successfully tied your shoes!",
picture="static/shoes.png"
)
Then all you have to do is keep the ids for each player's achievements in a db.StringListProperty. When you have the player's object loaded, rendering the achievements requires no additional db lookups - you already have the ids, now you just have to look them up in StaticAchievement.all. This is simple to implement, and allows you to easily query which users have a given achievement, etc. Nothing else is required.
If you want additional data associated with the user's possession of an achievement (e.g. the date at which it was acquired) then you have a choice of approaches:
1: Store it in another ListProperty of the same length.
This retains the implementation simplicity and the indexability of the properties. However, this sort of solution is not to everyone's taste. If you need to make the use of this data less messy, just write a method on the player object like this:
def achievement_tuples(self):
r = []
for i in range(0, len(self.achievements)):
r.append( (self.achievements[i], self.achievement_dates[i]) )
return r
You could handle progress by maintaining a parallel ListProperty of integers, and incrementing those integers when the user makes progress.
As long as you can understand how the data is represented, you can easily hide that representation behind whatever methods you want - allowing you to have both the interface you want and the performance and indexing characteristics you want. But if you don't really need the indexing and don't like the lists, see option #2.
2: Store the additional data in a BlobProperty.
This requires you to serialize and deserialize the data, and you give up the nice querying of the listproperties, but maybe you will be happier if you hate the concept of the parallel lists. This is what people tend to do when they really want the Python way of doing things, as distinct from the App Engine way.
3: Store the additional data in the db
(e.g. a PlayersAchievement object containing both a StringProperty of the Achievement id, and a DateProperty, and a UserProperty; and on the player an achivements listproperty full of references to PlayersAchievement objects)
Since we expect players to potentially get a large number of achievements, the overhead for this will get bad fast. Getting around it will get very complex very quickly: memcache, storing intermediate data like blobs of prerendered HTML or serialized lists of tuples, setting up tasks, etc. This is also the kind of stuff you will have to do if you want the achievement definitions themselves to be modifiable/stored in the DB.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.