How to scale database efficiently in Google App Engine?

How to scale database efficiently in Google App Engine? - python

I'm developing my first web application using Google App Engine Python SDK.
I know GAE handles scaling but I just want to know if I'm thinking about database design the right way.
For instance, if I have a User class that stores all usernames, hashed pw's etc., I'd imagine that once I have many users, reading from this User class would be slow.
Instead of having one giant User database, would I split it up so I have a UserA class, which stores all user info for usernames that begin with A? So I'd have a UserA class, UserB class, etc. Would this make reading/writing for users more efficient?
If I'm selling clothes on my app, instead of having one Clothing class, would I split it up by category so I have a ShirtsClothing class that only stores shirts, a PantsClothing class that stores only pants, etc?
Am I on the right track here?

I'd imagine that once I have many users, reading from this User class
would be slow.
No, reading a certain number of entries takes the same time no matter how many other unread entries are around, few or a bazillion of them.
Rather, if on a given query you only need a subset of the entities' fields, consider projection queries.
"Sharding" (e.g by user initial, clothing category, and so forth) is typically not going to improve your app's scalability. One exception might perhaps come if you need queries based on more than one inequality: the datastore natively supports inequality constraints on only one field per query, and perhaps some sharding might help alleviate that. But, just like all ilks of denormalization, that's strictly application-dependent: what queries will you need to perform, with what performance constraints/goals.
For some good tips on scalability practices, consider Google's own essays on the subject.

Related

How can I implement shadow editing (or revisions) across a model with relationships (django)?

While this is django+postgresql, the answer could be generic sql or from a "Databases for Dummies" book.
We have a database with several interrelated models (one to one, one to many, and many to many fields). We'd like to allow a user to shadow-edit the database, and only publish once he's happy with the changes.
For a single model, I could use something like django-reversions, and I could handle the relationships by hand in a hacky sort of way. But, this would have several side effects:
The models not in control of django could change, which would update the data immediately (no shadow copy)
Since external relationships are being stored, things will get strange if there are a lot of edits to them.
Huge amount of work 'catching' CRUD operations and routing them to published or draft entries (if particular user is editing)
Need to fix all pks on relations when publishing (more hack-titude)
What I'd really like is something that would do this:
Allow editing of many related tables at once, over many REST CRUD calls, and only updating after 'publishing'
Allow rolling back to previous version (versioning)
Any ideas?

How to instruct SQLAlchemy ORM to execute multiple queries in parallel when loading relationships?

I am using SQLAlchemy's ORM. I have a model that has multiple many-to-many relationships:
User
User <--MxN--> Organization
User <--MxN--> School
User <--MxN--> Credentials
I am implementing these using association tables, so there are also User_to_Organization, User_to_School and User_to_Credentials tables that I don't directly use.
Now, when I attempt to load a single User (using its PK identifier) and its relationships (and related models) using joined eager loading, I get horrible performance (15+ seconds). I assume this is due to this issue:
When multiple levels of depth are used with joined or subquery loading, loading collections-within- collections will multiply the total number of rows fetched in a cartesian fashion. Both forms of eager loading always join from the original parent class.
If I introduce another level or two to the hierarchy:
Organization <--1xN--> Project
School <--1xN--> Course
Project <--MxN--> Credentials
Course <--MxN--> Credentials
The query takes 50+ seconds to complete, even though the total amount of records in each table is fairly small.
Using lazy loading, I am required to manually load each relationship, and there are multiple round trips to the server.
e.g.
Operations, executed serially as queries:
Get user
Get user's Organizations
Get user's Schools
Get user's credentials
For each Organization, get its Projects
For each School, get its Courses
For each Project, get its Credentials
For each Course, get its Credentials
Still, it all finishes in less than 200ms.
I was wondering if there is anyway to indeed use lazy loading, but perform the relationship loading queries in parallel. For example, using the concurrent module, asyncio or by using gevent.
e.g.
Step 1 (in parallel):
Get user
Get user's Organizations
Get user's Schools
Get user's credentials
Step 2 (in parallel):
For each Organization, get its Projects
For each School, get its Courses
Step 3 (in parallel):
For each Project, get its Credentials
For each Course, get its Credentials
Actually, at this point, making a subquery type load can also work, that is, return Organization and OrganizationID/Project/Credentials in two separate queries:
e.g.
Step 1 (in parallel):
Get user
Get user's Organizations
Get user's Schools
Get user's credentials
Step 2 (in parallel):
Get Organizations
Get Schools
Get the Organizations' Projects, join with Credentials
Get the Schools' Courses, join with Credentials

The first thing you're going to want to do is check to see what queries are actually being executed on the db. I wouldn't assume that SQLAlchemy is doing what you expect unless you're very familiar with it. You can use echo=True on your engine configuration or look at some db logs (not sure how to do that with mysql).
You've mentioned that you're using different loading strategies so I guess you've read through the docs on that (
http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html). For what you're doing, I'd probably recommend subquery load, but it totally depends on the number of rows / columns you're dealing with. In my experience it's a good general starting point though.
One thing to note, you might need to something like:
db.query(Thing).options(subqueryload('A').subqueryload('B')).filter(Thing.id==x).first()
With filter.first rather that get, as the latter case won't re-execute queries according to your loading strategy if the primary object is already in the identity map.
Finally, I don't know your data - but those numbers sound pretty abysmal for anything short of a huge data set. Check that you have the correct indexes specified on all your tables.
You may have already been through all of this, but based on the information you've provided, it sounds like you need to do more work to narrow down your issue. Is it the db schema, or is it the queries SQLA is executing?
Either way, I'd say, "no" to running multiple queries on different connections. Any attempt to do that could result in inconsistent data coming back to your app, and if you think you've got issues now..... :-)

MySQL has no parallelism in a single connection. For the ORM to do such would require multiple connections to MySQL. Generally, the overhead of trying to do such is "not worth it".
To get a user, his Organizations, Schools, etc, can all be done (in mysql) via a single query:
SELECT user, organization, ...
FROM Users
JOIN Organizations ON ...
etc.
This is significantly more efficient than
SELECT user FROM ...;
SELECT organization ... WHERE user = ...;
etc.
(This is not "parallelism".)
Or maybe your "steps" are not quite 'right'?...
SELECT user, organization, project
FROM Users
JOIN Organizations ...
JOIN Projects ...
That gets, in a single step, all users, together with all their organizations and projects.
But is a "user" associated with a "project"? If not, then this is the wrong approach.
If the ORM is not providing a mechanism to generate queries like those, than it is "getting in the way".

The maximum number of objects that can be instantiated with a Django model?

I wrote an app to record the user interactions with the website search box,
the query string is saved as an object of the model SearchQuery. Whenever a user enters some data in the search box, I can save the search query and some info related to the query on the database.
This is for the idea of getting the search trends,
the fields in my database models are,
A Character Field (max_length=30)
A PositiveIntegerField
A BooleanField
My Questions are,
How many objects can be instantiated from the model SearchQuery? If there is a limit on numbers?
As the objects are not related (no db relationships) should I use MongoDB or some kind of NoSQLs for performance?
Is this a good design or should I do some more work to make it efficient?
Django version 1.6.5
Python version 2.7

How many objects can be instantiated from the model SearchQuery? If there is a limit on numbers?
As many as your chosen database can handle, this is probably in the millions. If you are concerned you can use a scheduler to delete older queries when they are no longer useful.
As the objects are not related (no db relationships) should I use MongoDB or some kind of NoSQLs for performance?
Could you, but its unlikely to give you much (if any efficiency gains). Because you are doing frequent writes and (presumably) infrequent reads, then its unlikely to hit the database very hard at all.
Is this a good design or should I do some more work to make it efficient?
There are probably two recommendations I'd make.
a. If you are going to be doing frequent reads on the Search log, look at using multiple databases. One for your log, and one for everything else.
b. Consider just using a regular log file for this information. Again, you will probably only be examining this data infrequently. So there are strng arguments to piping it into a log file, probably CSV-like, to make data analysis of it easier.

AppEngine model structure for user/follower relations

I have a users who have "followers". I need to be able to navigate up and down the tree of users/followers. I'm eventually going hit AppEngine's 1mb limit on entity entries if I use ancestor relations if a user has many followers.
What's the best way to structure this data on AppEngine?

You cannot use ancestor relations for a simple reason that your use case allows circular references (I follow you, you follow me).
The solution depends on your expected usage patterns. You can choose between two options:
(A) In each suer entity store a list of IDs of other users that this user is following.
(B) Create a separate entity that has two properties: "User" and"Follower". Every entity will represent a single "connection" between users.
While the first option seems simpler, you may run into exploding indexes problem. Besides, it may turn out to be a more expensive solution as each change in user relationships will require an overwrite of a user entity with updates to all of its other indexes. The second solution does not have these drawbacks, but may require a little extra code.

FRIENDS Table Datastore Design + App Engine Python

In my application, we need to develop a FRIENDS relationship table in datastore. And of course a quick solution I've thought of would be this:
user = db.ReferenceProperty(User, required=True, collection_name='user')
friend = db.ReferenceProperty(User, required=True, collection_name='friends')
But, what would happen when the friend list grows to a huge number, say few thousands or more ? Will this be too inefficient ?
Performance is always a priority to us. This is very much needed, as we would have few more to follow this similar relationship design.
Please give advice on the best approach to design for FRIENDS relationship table using datastore within App Engine Python environment.
EDIT
Other than FRIENDS relationship, FOLLOWER relationship will be created as well. And I believe it will be very often enough all these relationship to be queries most of the time, for the reason social media oriented of my application tend to be.
For example, If I follow some users, I will get update as news feed on what they will be doing etc. And the activities will be increased over time. As for how many users, I can't answer yet as we haven't go live. But I foresee to have millions of users as we go on.
Hopefully, this would help for more specific advice or is there alternative to this approach ?

Your FRIENDS model (and presumably also your FOLLOWERS model) should scale well. The tricky part in your system is actually aggregating the content from all of a user's friends and followees.
Querying for a list of a user's is O(N), where N is the number of friends, due to the table you've described in your post. However, each of those queries requires another O(N) operation to retrieve content that the friend has shared. This leads to O(N^2) each time a user wants to see recent content. This particular query is bad for two reasons:
An O(N^2) operation isn't what you want to see in your core algorithm when designing a system for millions of users.
App Engine tends to limit these kinds of queries. Specifically, the IN keyword you'd need to use in order to grab the list of shared items won't work for more than 30 friends.
For this particular problem, I'd recommend creating another table that links each user to each piece of shared content. Something like this:
class SharedItems(db.Model):
user = db.ReferenceProperty(User, required=True) # logged-in user
from = db.ReferenceProperty(User, required=True) # who shared it
item = db.ReferenceProperty(Item, required=True) # the item itself
posted = db.DateTimeProperty() # when it was shared
When it comes time to render the stream of updates, you need an O(N) query (N is the number of items you want to display) to look up all the items shared with the user (ordered by date descending). Keep N small to keep this as fast as possible.
Sharing an item requires creating O(N) SharedItems where N is the number of friends and followers the poster has. If this number is too large to handle in a single request, shard it out to a task queue or backend.

propertylist are a great way to get cheap/simple indexing in GAE.
but as u have correctly identified there is a few limitations.
the index size of the entire entity is limited (i think currently 5000). So each propertyList value will require an index. so basically propertylist size <4999
serialisation of such a large propertylist is expensive!!
bring back a 2Mb entity is slow... and will cost CPU.
if expecting a large propertyIndex then dont do it.
the alternative is to create a JOIN table that models the relationship
class Friends(db.Model):
user = db.ReferenceProperty(User, required=True) # logged-in user
from = db.ReferenceProperty(User, required=True) # who shared it
just a entity with 2 keys.
this allows for simple querying to find all friends for user.
select from friends where user = : me
find all user where i am the friend.
select from friends where friend = : me
since it returns a key, u can do a bulk get(keylist) to fetch the actual friends details.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.