FRIENDS Table Datastore Design + App Engine Python - python

In my application, we need to develop a FRIENDS relationship table in datastore. And of course a quick solution I've thought of would be this:
user = db.ReferenceProperty(User, required=True, collection_name='user')
friend = db.ReferenceProperty(User, required=True, collection_name='friends')
But, what would happen when the friend list grows to a huge number, say few thousands or more ? Will this be too inefficient ?
Performance is always a priority to us. This is very much needed, as we would have few more to follow this similar relationship design.
Please give advice on the best approach to design for FRIENDS relationship table using datastore within App Engine Python environment.
EDIT
Other than FRIENDS relationship, FOLLOWER relationship will be created as well. And I believe it will be very often enough all these relationship to be queries most of the time, for the reason social media oriented of my application tend to be.
For example, If I follow some users, I will get update as news feed on what they will be doing etc. And the activities will be increased over time. As for how many users, I can't answer yet as we haven't go live. But I foresee to have millions of users as we go on.
Hopefully, this would help for more specific advice or is there alternative to this approach ?

Your FRIENDS model (and presumably also your FOLLOWERS model) should scale well. The tricky part in your system is actually aggregating the content from all of a user's friends and followees.
Querying for a list of a user's is O(N), where N is the number of friends, due to the table you've described in your post. However, each of those queries requires another O(N) operation to retrieve content that the friend has shared. This leads to O(N^2) each time a user wants to see recent content. This particular query is bad for two reasons:
An O(N^2) operation isn't what you want to see in your core algorithm when designing a system for millions of users.
App Engine tends to limit these kinds of queries. Specifically, the IN keyword you'd need to use in order to grab the list of shared items won't work for more than 30 friends.
For this particular problem, I'd recommend creating another table that links each user to each piece of shared content. Something like this:
class SharedItems(db.Model):
user = db.ReferenceProperty(User, required=True) # logged-in user
from = db.ReferenceProperty(User, required=True) # who shared it
item = db.ReferenceProperty(Item, required=True) # the item itself
posted = db.DateTimeProperty() # when it was shared
When it comes time to render the stream of updates, you need an O(N) query (N is the number of items you want to display) to look up all the items shared with the user (ordered by date descending). Keep N small to keep this as fast as possible.
Sharing an item requires creating O(N) SharedItems where N is the number of friends and followers the poster has. If this number is too large to handle in a single request, shard it out to a task queue or backend.

propertylist are a great way to get cheap/simple indexing in GAE.
but as u have correctly identified there is a few limitations.
the index size of the entire entity is limited (i think currently 5000). So each propertyList value will require an index. so basically propertylist size <4999
serialisation of such a large propertylist is expensive!!
bring back a 2Mb entity is slow... and will cost CPU.
if expecting a large propertyIndex then dont do it.
the alternative is to create a JOIN table that models the relationship
class Friends(db.Model):
user = db.ReferenceProperty(User, required=True) # logged-in user
from = db.ReferenceProperty(User, required=True) # who shared it
just a entity with 2 keys.
this allows for simple querying to find all friends for user.
select from friends where user = : me
find all user where i am the friend.
select from friends where friend = : me
since it returns a key, u can do a bulk get(keylist) to fetch the actual friends details.

Related

How to scale database efficiently in Google App Engine?

I'm developing my first web application using Google App Engine Python SDK.
I know GAE handles scaling but I just want to know if I'm thinking about database design the right way.
For instance, if I have a User class that stores all usernames, hashed pw's etc., I'd imagine that once I have many users, reading from this User class would be slow.
Instead of having one giant User database, would I split it up so I have a UserA class, which stores all user info for usernames that begin with A? So I'd have a UserA class, UserB class, etc. Would this make reading/writing for users more efficient?
If I'm selling clothes on my app, instead of having one Clothing class, would I split it up by category so I have a ShirtsClothing class that only stores shirts, a PantsClothing class that stores only pants, etc?
Am I on the right track here?
I'd imagine that once I have many users, reading from this User class
would be slow.
No, reading a certain number of entries takes the same time no matter how many other unread entries are around, few or a bazillion of them.
Rather, if on a given query you only need a subset of the entities' fields, consider projection queries.
"Sharding" (e.g by user initial, clothing category, and so forth) is typically not going to improve your app's scalability. One exception might perhaps come if you need queries based on more than one inequality: the datastore natively supports inequality constraints on only one field per query, and perhaps some sharding might help alleviate that. But, just like all ilks of denormalization, that's strictly application-dependent: what queries will you need to perform, with what performance constraints/goals.
For some good tips on scalability practices, consider Google's own essays on the subject.

Structuring NDB models to make an "antijoin" query possible

I want to have a recommendation API for my app where a user can get a suggestion for an object they haven't yet seen, and I'm having trouble figuring out how I'm supposed to structure my data to make such a query efficient.
Here's an example using books. Suppose this is what I have as my models, the most normalized way possible:
def User(ndb.Model):
name = ndb.StringProperty()
def Book(ndb.Model):
title = ndb.StringProperty()
def Review(ndb.Model):
user = ndb.KeyProperty(User)
book = ndb.KeyProperty(Book)
stars = ndb.IntegerProperty()
text = ndb.TextProperty()
Now given a user, I want to retrieve a book that the user hasn't reviewed and this seems basically impossible to do efficiently and at scale (e.g. 50k users, 100k books).
I've read around and I realize that I should denormalize my data somehow, but for the life of me, I can't figure out a good way to do it. I've thought of putting Review as a StructuredProperty inside of Book, but I don't think that really buys me very much, and it means that I'll be limited by the number of reviews I can add to a book (because of the size limit for entries).
The other things I've seen mentioned a lot when other people asked similar questions are ancestors and ComputedProperty, but I don't really see how they help me here either.
Surely it's not actually impossible, and I just have a weak understanding of the best practices, right?
A useful de-normalization might be to add to User the list of books they've reviewed:
def User(ndb.Model):
name = ndb.StringProperty()
seen = ndb.KeyProperty('Book', repeated=True)
and to 'Book' an overall "score" whereby you'll want to order queries:
def Book(ndb.Model):
title = ndb.StringProperty()
score = ndb.IntegerProperty()
as usual the cost of de-normalization comes when writing -- in addition to creating a new review you'll also need to update the User and Book entities (and you may need a transaction, thus, entity groups, if e.g several users may be reviewing a book at the same time, but I'm skipping that part:-).
The advantage is that, when needing to propose a new book to a given User, you can query for Book (keys-only, sorted by score), with a cursor (or just a loop on the query) to "page through" the query's results, and just reject keys that, as you can check in-memory, are already in the given User's seen property.
Upon getting the User entity for the purpose, you can turn the seen into a set, so the checks will be very fast. This assumes that a user won't review more than a few thousand books, so everything needed should nicely fit in memory...

how to query by sum in ndb

I'm trying to build some kind of KPI into my site, and struggling on how to retrive the data.
for example, let's say I'm building a blog with a model of :
class MyPost(ndb.Model):
Author = ndb.KeyProperty(MyUser, required = True)
when = TZDateTimeProperty(required = True)
status = ndb.IntegerProperty(default = 1) # 1=draft, 2=published
text = ndb.TextProperty()
and I want to build a query that would list my top authors that would give me a result of (preferably sorted)
['Jack':10, 'Jane':8, 'Joe',0]
I can think of 2 ways:
query().fetch() all items and manually count then
this is very un-efficient but most flexible
for author in Users: result[author]=query(...).fetch().count()
so-so efficiency, and requires to know my indexes in advance (would not work if I want to query by "author's favorit pet"
which one is preferable ?
what other methods would you recommend ?
I'd recommend de-normalizing the MyUser model, that is, introducing redundancy, by giving MyUser an IntegerProperty, say numposts, that redundantly keeps track of how many MyPost entities that user has authored. The need to de-normalize is frequent in NoSQL data stores.
The price you pay for this modest de-normalization is that adding a new post requires more work, since you also need to increment the author's numposts when that happens. However, more often than not, a data store is "read mostly" -- addition of new entities is comparatively rare compared to querying existing ones. The purpose of de-normalization is to make the latter activity vastly more efficient, for important queries, at a modest cost to the former activity.

Google app engine, query multiple entities

I have following 2 entities.
class Photo(db.Model):
name=db.StringProperty()
registerdate=db.DateTimeProperty()
iso=db.StringProperty()
exposure=db.StringProperty()
class PhotoRatings(db.Model):
ratings=db.IntegerProperty()
I need to do the following.
Get all the photos (Photo) with iso=800 sorted by ratings (PhotoRatings).
I cannot add add ratings inside Photo because ratings change all the time and I would have to write entire Photo entity every single time. This will cost me more time and money and the application will take performance hit.
I read this,
https://developers.google.com/appengine/articles/modeling
But could not get much information from it.
EDIT: I want to avoid fetching too many items and perform the match manually. I need fast and efficient solution.
You're trying to do relational database queries with an explicitly non-relational Datastore.
As you might imagine, this presents problems. If you want to the Datastore to sort the results for you, it has to be able to index on what you're wanting to sort. Indices cannot span multiple entity types, so you can't have an index for Photos that is ordered by PhotoRatings.
Sorry.
Consider, however - which will happen more often? Querying for this ordering of photos, or someone rating a photo? Chances are, you'll have far more views than actions, so storing the rating as part of the Photo entity might not be as big a hit as you fear.
If you look at the billing docs you'll notice that an Entity write is charged per changed number of properties.
So what you are trying to do will not reduce write cost, but will definitely increase read cost as you'll be reading double the number of entities.

Possible Data Schemes for Achievements on Google App Engine

I'm building a flash game website using Google App Engine. I'm going to put achievements on it, and am scratching my head on how exactly to store this data. I have a (non-google) user model. I need to know how to store each kind of achievement (name, description, picture), as well as link them with users who earn them. Also, I need to keep track of when they earned it, as well as their current progress for those who don't have one yet.
If anyone wants to give suggestions about detection or other achievement related task feel free.
EDIT:
Non-Google User Model:
class BeardUser(db.Model):
email = db.StringProperty()
username = db.StringProperty()
password = db.StringProperty()
settings = db.ReferenceProperty(UserSettings)
is_admin = db.BooleanProperty()
user_role = db.StringProperty(default="user")
datetime = db.DateTimeProperty(auto_now_add=True)
Unless your users will be adding achievements dynamically (as in something like Kongregate where users can upload their own games, etc), your achievement list will be static. That means you can store the (name, description, picture) list in your main python file, or as an achievements module or something. You can reference the achievements in your dynamic data by name or id as specified in this list.
To indicate whether a given user has gotten an achievement, a simple way is to add a ListProperty to your player model that will hold the achievements that the player has gotten. By default this will be indexed so you can query for which players have gotten which achievements, etc.
If you also want to store what date/time the user has gotten the achievement, we're getting into an area where the built-in properties are less ideal. You could add another ListProperty with datetime properties corresponding to each achievement, but that's awkward: what we'd really like is a tuple or dict so that we can store the achievement id/name together with the time it was achieved, and make it easy to add additional properties related to each achievement in the future.
My usual approach for cases like this is to dump my ideal data structure into a BlobProperty, but that has some disadvantages and you may prefer a different approach.
Another approach is to have an achievement model separate from your user model, and a ListProperty full of referenceProperties to all the achievements that user has gotten. this has the advantage of letting you index everything very nicely but will be an additional API CPU cost at runtime especially if you have a lot of achievements to look up, and operations like deleting all a user's achievements will be very expensive compared to the above.
Building on Brandon's answer here.
If you can, it would be MUCH faster to define all the achievements in the Python files, so they are in-memory and require no database fetching. This is also simpler to implement.
class StaticAchievement(object):
"""
An achievement defined in a Python file.
"""
by_name = {}
by_index = []
def __init__(self, name, description="", picture=None):
if picture is None: picture = "static/default_achievement.png"
StaticAchievement.by_name[name] = self
StaticAchievement.by_index.append(self)
self.index = len(StaticAchievement.by_index)
# This automatically adds an entry to the StaticAchievement.by_name dict.
# It also adds an entry to to the StaticAchievement.by_index list.
StaticAchievement(
name="tied your shoe",
description="You successfully tied your shoes!",
picture="static/shoes.png"
)
Then all you have to do is keep the ids for each player's achievements in a db.StringListProperty. When you have the player's object loaded, rendering the achievements requires no additional db lookups - you already have the ids, now you just have to look them up in StaticAchievement.all. This is simple to implement, and allows you to easily query which users have a given achievement, etc. Nothing else is required.
If you want additional data associated with the user's possession of an achievement (e.g. the date at which it was acquired) then you have a choice of approaches:
1: Store it in another ListProperty of the same length.
This retains the implementation simplicity and the indexability of the properties. However, this sort of solution is not to everyone's taste. If you need to make the use of this data less messy, just write a method on the player object like this:
def achievement_tuples(self):
r = []
for i in range(0, len(self.achievements)):
r.append( (self.achievements[i], self.achievement_dates[i]) )
return r
You could handle progress by maintaining a parallel ListProperty of integers, and incrementing those integers when the user makes progress.
As long as you can understand how the data is represented, you can easily hide that representation behind whatever methods you want - allowing you to have both the interface you want and the performance and indexing characteristics you want. But if you don't really need the indexing and don't like the lists, see option #2.
2: Store the additional data in a BlobProperty.
This requires you to serialize and deserialize the data, and you give up the nice querying of the listproperties, but maybe you will be happier if you hate the concept of the parallel lists. This is what people tend to do when they really want the Python way of doing things, as distinct from the App Engine way.
3: Store the additional data in the db
(e.g. a PlayersAchievement object containing both a StringProperty of the Achievement id, and a DateProperty, and a UserProperty; and on the player an achivements listproperty full of references to PlayersAchievement objects)
Since we expect players to potentially get a large number of achievements, the overhead for this will get bad fast. Getting around it will get very complex very quickly: memcache, storing intermediate data like blobs of prerendered HTML or serialized lists of tuples, setting up tasks, etc. This is also the kind of stuff you will have to do if you want the achievement definitions themselves to be modifiable/stored in the DB.

Categories

Resources