GAE Entity group / data modeling for consistency and performance

GAE Entity group / data modeling for consistency and performance - python

As a continuation of in this post, this is a bit of a capstone-style question to solidify my understanding of gae-datastore and get some critiques on my data modeling decisions. I'll be modifying he Jukebox example created by #Jimmy Kane to better reflect my real world case.
In the original setup,
imagine that you have a jukebox with queues per room let's say. And people are queueing songs to each queue of each jukebox.
J=Jukebox, Q=queue, S=Song
Jukebox
/ | \
Q1 Q2 Q3
/ | \ | \
S1 S2 S3 S4 S5
First, fill out the Song model as such:
Song(ndb.Model):
user_key = ndb.KeyProperty()
status = ndb.StringProperty()
datetime_added = ndb.DateTimeProperty()
My modification is to add a User that can CUD songs to any queue. In the frontend, users will visit a UI to see their songs in each of the queues, and make changes. In the backend, the application needs to know which songs are in each queue, play the right song off each queue and remove songs from queues once played.
In order for a User to be able to see its songs in queue I'm presuming each User would be a root entity and would need to store a list of Song keys
User(ndb.Model):
song_keys = ndb.KeyProperty(kind='Song', repeated=True)
Then, to retrieve the user's songs, the application would (presuming user_id is known)
user = User.get_by_id(user_id)
songs = ndb.get_multi(user.song_keys)
And, since gets are strongly consistent, the user would always see non-stale data
Then, when queue 1 is finished playing a song, the application could do something like:
current_song.status = "inactive"
current_song.put()
query=Song.query(ancestor=ndb.Key('Jukebox', '1', 'Queue', '1')).filter(Song.status=="active").order(Song.datetime_added)
next_song = query.get()
Am I right in thinking that the ancestor query ensures consistent representation of the preceding deactivation of the current song as well as any CUD from the Users?
The final step would be to update the User's song_keys list in a transaction
user = current_song.user_key.get()
user.song_keys.remove(current_song.key)
user.put()
Summary and some pros/cons
The consistency seems to be doing the right things in the rightplaces
if my understanding is right?
Should I be concerned about contention on the Jukebox entity group?
I wouldn't expect it to be a high throughput type of use case but my real-life scenario needs to scale with the number of users and there are probably a similar number of queues as there are users, maybe 2x - 5x more users than queues. If the whole group is limited to 1 write / sec and lots of users as well as each queue could be creating and updating songs, this could be a bottleneck
One solution could be to do away with the Jukebox root entity and have each Queue be its own root entity
User.song_keys could be long-ish, say 100 song.keys. This article advised "to avoid storing overly large lists of keys in a ListProperty". What's the concern here? Is this a db concept and moot with ndb's way of handling lists with the repeated=True property option?
Opinions on this approach or critiques on things I'm fundamentally misunderstanding?
Presumably, I could also alternatively, kind of just symmetrically flip
the data models and have entity groups that look like User ->
Song and store song_keys lists in the Queue model

I think you should reconsider how important is strong consistency for your use case. From what I can see it is not critical that all this entities have strong consistency. In my opinion, eventual consistency will work just fine. Most of the time you will see up to date data and only sometimes (read: really really rarely) you will see some stale data. Think about how critical is that you always get up to date data vs how much it penalizes your application. Entities that need strong consistency are not stored in the most efficient way in terms of number of reads per second.
Also if you look at the document Structuring Data for Strong Consistency, you will see that it mentions that you can't have more then 1 write per second when using that approach.
Also having entity groups effects data locality as per AppEngine Model Class docs.
If you also read the famous Google's doc on Google Spanner, section 2 you will see how they deal with entities which have same parent key. Essentially, they are put closer together. I assume Google might be using similar approach with AppEngine Datastore. At some point, according to this source Google might use Spanner for AppEngine Datastore in the future.
Another point, there is no cheaper of faster get then get by key. Having said this, if you can somehow avoid querying this could reduct the cost of running you application. Assuming that you're developing a web application you can store your song keys in a JSON/text object and then use Prospective Search API to get up to date results. This approach requires a bit more work and requires you to embrace eventual consistency model as the data might be slightly out of date by the time it reaches the client. Depending on your use case (this does not apply a small application and small user base obviously) the savings might out-weight the cost. When I say the cost I mean the fact that data might be slightly out of date.
In my experience, strong consistency is not a requirement for a large number of applications. The number of applications that can live with slightly stale data seems to outnumber the applications that cannot. Take YouTube for example, I don't really mind if I don't see all the videos immediately (as there's such a large number that I can't even know if I see all of them or not). When you design something like this, first ask yourself question, is it really necessary to provide up to date data or a bit stale data is good enough? Can the user even tell the difference? Up to date data is much more expensive then a little bit stale.

I've decided to take another approach, which is to rely on lists of song_keys at the Queues in addition to the Users. This way, I have strong consistency when dealing with Users and with Queues without needing to deal with the performance/consistency tradeoff that comes with entity groups. As a positive byproduct, getting keys leverages ndb autocaching so I anticipate a performance boost with enhanced code simplicity.
Still welcome any critiques...
UDPATE: A little more detail regarding autocaching. NDB automatically manages caching via memcache and an in-context cache. For my purposes, I'm mostly interested in the automatic memcache. By using predominantly get requests in favor of queries, NDB will check memcache first before reading from the datastore for all of those reads. I anticipate most requests to actually be fulfilled out of memcache rather than the datastore. I understand that I could manage all of that memcache activity myself and most likely in a way that would work decently with a query-focused approach so perhaps some wouldn't consider that a great rationale for the design decision. But the impact on code simplicity is pretty nice.

Related

Google app engine: using a property value as a key

I have a next entities:
class Player(ndb.Model):
player_id = ndb.IntegerProperty()
and
class TimeRecord(ndb.Model):
time = ndb.StringProperty()
So TimeRecord's instance is a child of certain instance of Player.
If I need to put a instance of TimeRecord with certain Player I am doing like this:
tr = TimeRecord(parent = ndb.Key("Player", Player.query(Player.player_id == int(certain_id)).get().key.integer_id()), time = value)
This query is expensive and sophisticated. Accordingly to doc
qry = Account.query(Account.userid == 42)
If you are sure that there was just one Account with that userid, you might prefer to use userid as a key. Account.get_by_id(...) is faster than Account.query(...).get().
As I understand I need to change structure of my datastore:
Use player_id as a key of Player and move TimeRecord (time) to property of Players. player_id is unique value.
class Player(ndb.Model):
time = ndb.StringProperty()
Q: Is that a right approach?
This is similar to mixing different levels of entities inheritance due to as I see every data should be a different entity. And inheritance implemented by ancestor keys.
Upd:
But in this case I can store just one TimeRecord value for a certain Player.
And I need a set of TimeRecords for a Player. Is a repeated property solution of this problem?

The redesign you're proposing is essentially, from the POV of a relational database user, a "de-normalization" -- which is almost a bad word in the relational field, but absolutely "normal" (ha ha) once you move into NoSQL.
If you know how things will be queried and updated, de-normalization improves performance (usually) and/or storage (sometimes) at the expense of some flexibility.
Do be aware of the trade-offs, though. Often, de-normalizing improves performance in querying/reading at the expense of extra burdens in updating -- that can be fine since typically reading is much more frequent than writing, but you need to know whether this is the case for your application.
Examining your specific use case, I see definite savings in storage (esp. if you can use a more specialized type for your time property, see https://cloud.google.com/appengine/docs/python/ndb/properties#Date_and_Time) and fewer interactions with the backend (thus better performance) on retrievals. It also simplifies your code (simplicity is good: fewer risks of bugs, easier to unit-test).
However, if saving new "time records" is a very frequent need for a player, the repeated property grows larger and larger (at some point this slows things down despite it still being a single interaction; at worst it would "bump its head" against a single entity's maximum size, which is one megabyte -- sure, that would take many tens of thousands of "time records" per player, but, not knowing your app at all, I can't tell whether that's a risk... only you can!-).
Queries can also be a problem, again entirely depending on what your app needs. I'm specifically thinking of inequality queries. Suppose you need all players with time records greater than, say, '20141215-10:00:00', and smaller than, say, '20141215-18:00:00'. Alas, an inequality query on a repeated property won't do that for you! That is, if you query for
ndb.AND(Player.time > '20141215-10:00:00',
Player.time < '20141215-18:00:00')
you'll get players with a time greater than the first constant and a time less than the second one -- not necessarily the same time! This means the query may return many more players than you wish it would, and you'll need to "filter" the resulting bunch of players in your app's code.
If you had an entity where time is not a repeated property (such as your original TimeRecord entity), then the query analogous to this one would return exactly the bunch of entities of interest (though if you then needed to fetch the players sporting those times, you'd then need another interaction with the storage back-end, typically an ndb.get_multi, so it's hard to predict performance effects without knowing much more about your app's operational parameters!).
That's what de-normalization usually boils down to: trade-offs between different aspect of "desirability" (simplicity, storage saving, fewer backend interactions, smaller amounts of data going to/from the backend -- and we're not even getting into atomic transactions and applicability of async techniques!-) -- trade-offs that can be made only with some deep understanding of an app's operational parameters.
Indeed it may be worth deploying two or more prototypes, each to a small set of users, to get actual data about how they perform (the new Cloud Monitoring offer can help with the "get actual data" part), before choosing a "definitive" (ha!) architecture -- despite the fact that migrating the data from the prototypes to the "definitive" schema will incur overhead-effort needs.
And if the app is an overnight success and suddenly you get tens of thousands of queries per second, rather than the orders-of-magnitude fewer you had planned for, the performance characteristics may just as suddenly change to the point the pain of re-architecting and migrating again may be warranted (a good problem to have, for sure, but still...).

In practice, how eventual is the "eventual consistency" in HRD?

I am in the process of migrating an application from Master/Slave to HRD. I would like to hear some comments from who already went through the migration.
I tried a simple example to just post a new entity without ancestor and redirecting to a page to list all entities from that model. I tried it several times and it was always consistent. Them I put 500 indexed properties and again, always consistent...
I was also worried about some claims of a limit of one 1 put() per entity group per second. I put() 30 entities with same ancestor (same HTTP request but put() one by one) and it was basically no difference from puting 30 entities without ancestor. (I am using NDB, could it be doing some kind of optimization?)
I tested this with an empty app without any traffic and I am wondering how much a real traffic would affect the "eventual consistency".
I am aware I can test "eventual consistency" on local development. My question is:
Do I really need to restructure my app to handle eventual consistency?
Or it would be acceptable to leave it the way it is because the eventual consistency is actually consistent in practice for 99%?

If you have a small app then your data probably live on the same part of the same disk and you have one instance. You probably won't notice eventual consistency. As your app grows, you notice it more. Usually it takes milliseconds to reach consistency, but I've seen cases where it takes an hour or more.
Generally, queries is where you notice it most. One way to reduce the impact is to query by keys only and then use ndb.get_multi() to load the entities. Fetching entities by keys ensures that you get the latest version of that entity. It doesn't guarantee that the keys list is strongly consistent, though. So you might get entities that don't match the query conditions, so loop through the entities and skip the ones that don't match.
From what I've noticed, the pain of eventual consistency grows gradually as your app grows. At some point you do need to take it seriously and update the critical areas of your code to handle it.

What's the worst case if you get inconsistent results?
Does a user see some unimportant info that's out of date? That's probably ok.
Will you miscalculate something important, like the price of something? Or the number of items in stock in a store? In that case, you would want to avoid that chance occurence.
From observation only, it seems like eventually consistent results show up more as your dataset gets larger, I suspect as your data is split across more tablets.
Also, if you're reading your entities back with get() requests by key/id, it'll always be consistent. Make sure you're doing a query to get eventually consistent results.

The replication speed is going to be primarily server-workload-dependent. Typically on an unloaded system the replication delay is going to be milliseconds.
But the idea of "eventually consistent" is that you need to write your app so that you don't rely on that; any replication delay needs to be allowable within the constraints of your application.

Python object hierarchy, and REST resources?

I am currently running into an "architectural" problem in my Python app (using Twisted) that uses a REST api, and I am looking for feedback.
Warning ! long post ahead!
Lets assume the following Object hiearchy:
class Device(object):
def __init__():
self._driver=Driver()
self._status=Status()
self._tasks=TaskManager()
def __getattr__(self, attr_name):
if hasattr(self._tasks, attr_name):
return getattr(self._tasks, attr_name)
else:
raise AttributeError(attr_name)
class Driver(object):
def __init__(self):
self._status=DriverStatus()
def connect(self):
"""some code here"""
def disconnect(self):
"""some code here"""
class DriverStatus(object):
def __init__(self):
self._isConnected=False
self._isPluggedIn=False
I also have a rather deep object hiearchy (the above elements are only a sub part of it) So, right now this gives me following resources, in the rest api (i know, rest isn't about url hierarchy, but media types, but this is for simplicity's sake):
/rest/environments
/rest/environments/{id}
/rest/environments/{id}/devices/
/rest/environments/{id}/devices/{deviceId}
/rest/environments/{id}/devices/{deviceId}/driver
/rest/environments/{id}/devices/{deviceId}/driver/driverstatus
I switched a few months back from a "dirty" soap type Api to REST, but I am becoming unsure about how to handle what seems like added complexity:
Proliferation of REST resources/media types : for example instead of having just a Device resource I now have all these resources:
Device
DeviceStatus
Driver
DriverStatus
While these all make sense from a Resfull point of view, is it normal to have a lot of sub resources that each map to a separate python class ?
Mapping a method rich application core to a Rest-Full api : in Rest resources should be nouns, not verbs : are there good rules /tips to inteligently define a set of resources from a set of methods ? (The most comprehensive example I found so far seems to be this article)
Api logic influencing application structure: should an application's API logic at least partially guide some of its internal logic, or is it good practice to apply separation of concerns ? Ie , should I have an intermediate layer of "resource" objects that have the job of communicating with the application core , but that do not map one to one to the core's classes ?
How would one correctly handle the following in a rest-full way : I need to be able to display a list of available driver types (ie class names, not Driver instance) in the client : would this mean creating yet another resource like "DriverTypes" ?
These are rather long winded questions, so thanks for your patience, and any pointers, feedback and criticism is more than welcome !
To S.Lott:
By "too fragmented resources" what i meant was, lots of different sub resources that basically still apply to the same server side entity
For The "connection" : So that would be a modified version of the "DriverStatus" resource then ? I consider the connection to be always existing, hence the use of "PUT" , but would that be bad thing considering "PUT" should be idempotent ?
You are right about "stopping coding and rethinking", that is the reason I asked all these questions and putting things down, on paper to get a better overview.
-The thing is, right now the basic "real world objects" as you put them make sense to me as rest resources /collections of resources, and they are correctly manipulated via POST, GET, UPDATE, DELETE , but I am having a hard time getting my head around the Rest approach for things that I do not instinctively view as "Resources".

Rule 1. REST is about objects. Not methods.
The REST "resources" have become too fragmented
False. Always false. REST resources are independent. They can't be "too" fragmented.
instead of having just a Device resource I now have all these resources:
Device DeviceStatus Driver DriverStatus
While these all make sense
from a [RESTful] point of view, is it normal to have a lot of sub
resources that each map to a separate python class ?
Actually, they don't make sense. Hence your question.
Device is a thing. /rest/environments/{id}/devices/{deviceId}
It has status. You should consider providing the status and the device information together as a single composite document that describes a device.
Just because your relational database is normalized does not mean your RESTful objects need to be precisely as normalized as your database. While it's simpler (and many frameworks make it very, very simple to do this) it may not be meaningful.
consider the connection to be always existing, hence the use of "PUT"
, but would that be bad thing considering "PUT" should be idempotent ?
Connections do not always exist. They may come and go.
While a relational database may have a many-to-many association table which you can UPDATE, that's a peculiar special case that doesn't really make much sense outside the world of DBA's.
The connection between two RESTful things is rarely a separate thing. It's an attribute of each of the RESTful things.
It's perfectly unclear what this "connection" thing is. You talk vaguely about it, but provide no details.
Lacking any usable facts, I'll guess that you're connecting devices to drivers and there's some kind of [Device]<-[Driver Status]->[Driver] relationship. The connection from device to driver can be a separate RESTful resource.
It can just as easily be an attribute of Device or Driver that does not actually have a separate, visible, RESTful resource.
[Again. Some frameworks like Django-Piston make it trivial to simple expose the underlying classes. This may not always be appropriate, however.]
are there good rules /tips to inteligently define a set of resources from a set of methods ?
Yes. Don't do it. Resources aren't methods. Pretty much that's that.
If you have a lot of methods -- outside CRUD -- then you may have a data model issue. You may have too few classes of things expressed in your relational model and too many stateful updates of things.
Stateful objects are not inherently evil, but they need to be examined critically. In some cases, a PUT to change status of an object perhaps should have been a POST to add to the history of an object. The "current" state is the last thing POSTed.
Also.
You don't have to trivialize each resource as a class of things. You can have resources which are collections. You can POST a fairly complex document to a composite (properly a Facade) "resource". That complex document can imply several CRUD operations in the database.
You're wandering away from simple RESTful. Your question remains intentionally murky. "method rich application core" doesn't mean much. Without concrete examples, it's impossible to imagine.
Api logic influencing application structure
If these are somehow different, you're probably creating needless, no-value complexity.
is it good practice to apply separation of concerns ?
Always. Why ask?
a lot of this seems to come from my confusion about how to map a rather method rich api to a Rest-Full one , where resources should be nouns, not verbs : so when is it wise to consider an element a rest "resource"?
A resource is defined by your problem domain. It's usually something tangible. The methods (as in "method-rich API" are usually irrelevant. They're CRUD (Create, Retrieve, Update, Delete) operations. If you have something that's not essentially CRUD, you have to STOP coding. STOP writing code, and rethink the API to make it CRUD-like.
CRUD - Create-Retrieve-Update-Delete maps to REST's POST-GET-PUT-DELETE. If you can't recast your problem domain into those terms, stop coding. Stop coding until you get to CRUD rules.
i need to be able to display a list of available driver types (ie class names, not Driver instance) in the client : would this mean creating yet another resource like "DriverTypes" ?
Correct. They're already part of your problem domain. You already have this class defined. You're just making it available through REST.
Here's the point. The problem domain has real-world objects. You have class definitions. They're tangible things. REST transfers the state of those tangible things.
Your software may have intangible things like "associations" or "links" or "connections" other junk that's part of the software solution. This junk doesn't matter very much. It's implementation detail. Not real-world things.
An "association" is always visible from both of the two real-world RESTful resources. One resource may have an foreign-key like reference that allows the client to do a RESTful fetch of another, related object. Or a resource may have a collection of other, related objects, and a single GET retrieves an object and a collection of related objects.
Either way, the real-world RESTful resources are what's available. The relationship is merely implied. Even if it's a physical many-to-many database table -- that doesn't mean it must be exposed. [Again. Some frameworks make it trivially easy to expose everything. This isn't always good.]

You can represent the path portion /rest with a Site object, but environments in the path must be a Resource. From there you have to handle the hierarchy yourself in the render_* methods of environments. The request object you get will have a postpath attribute that gives you the remainder of the path (i.e. after /rest/environments). You'll have to parse out the id, detect whether or not devices is given in the path, and if so pass the remainder of the path (and the request) down to your devices collection. Unfortunately, Twisted will not handle this decision for you.

Google App Engine counters

For all my data in the GAE Datastore I have a model for keeping track of counters/total number of records (since we can't use traditional SUM queries). I want to know the most efficient way of incrementing these global count values whenever I insert/delete a record. This is what I'm currently doing:
counter = DBCounter.all().fetch(1)
dbc = DBCounter(totalTopics=counter[0].totalTopics+1)
dbc.put()
But this seems quite sloppy to me. Any thoughts on a better way to do this?

There are a few issues with your approach:
It may under-count since you don't use a transaction to atomically update the counter.
It is inefficient:
Contention may become a problem if you need to update this counter frequently. Since you only have one counter, it won't scale well. Datastore entities can only be written at a rate of at most 5 times per second.
You're writing to the datastore twice every time you insert a record. If you end up using transactions to fix the above problem, then you'll be making two round-trips to the datastore every time you insert the record (once to insert, and once to update the counter). You might be able to use an approach which avoids this extra round-trip to the datastore.
Here are some alternate approaches (from least accurate [and fastest] to most accurate [and slowest]):
If you only need a rough count of the number of entities of particular kind in the datastore, then you can use the Stats API. The counts you retrieve are not constantly updated, however.
If you need more granularity but are okay with a small possibility of occasionally under-counting, then you could use a memcache-enhanced counter. There are several good implementations discussed in this question. In particular, see the code in the comments in this recipe.
If you really want to avoid undercounting, then you should consider a sharded datastore counter. This will eliminate the contention issue from above.

If you need to keep scalability while counting, you should look into Joe Gregorio's article on sharding counters and DocSavage's implementation of the idea.
AppEngineFan's excellent blog also has info on scalable non-sharded counters, see this one which uses task queues and points to the previous article on using cron jobs instead.

Capturing Implicit Signals of Interest in Django

To set the background: I'm interested in:
Capturing implicit signals of interest in books as users browse around a site. The site is written in django (python) using mysql, memcached, ngnix, and apache
Let's say, for instance, my site sells books. As a user browses around my site I'd like to keep track of which books they've viewed, and how many times they've viewed them.
Not that I'd store the data this way, but ideally I could have on-the-fly access to a structure like:
{user_id : {book_id: number_of_views, book_id_2: number_of_views}}
I realize there are a few approaches here:
Some flat-file log
Writing an object to a database every time
Writing to an object in memcached
I don't really know the performance implications, but I'd rather not be writing to a database on every single page view, and the lag writing to a log and computing the structure later seems not quick enough to give good recommendations on-the-fly as you use the site, and the memcached appraoch seems fine, but there's a cost in keeping this obj in memory: you might lose it, and it never gets written somewhere 'permanent'.
What approach would you suggest? (doesn't have to be one of the above) Thanks!

If this data is not an unimportant statistic that might or might not be available I'd suggest taking the simple approach and using a model. It will surely hit the database everytime.
Unless you are absolutely positively sure these queries are actually degrading overall experience there is no need to worry about it. Even if you optimize this one, there's a good chance other unexpected queries are wasting more CPU time. I assume you wouldn't be asking this question if you were testing all other queries. So why risk premature optimization on this one?
An advantage of the model approach would be having an API in place. When you have tested and decided to optimize you can keep this API and change the underlying model with something else (which will most probably be more complex than a model).
I'd definitely go with a model first and see how it performs. (and also how other parts of the project perform)

What approach would you suggest? (doesn't have to be one of the above) Thanks!
hmmmm ...this like been in a four walled room with only one door and saying i want to get out of room but not through the only door...
There was an article i was reading sometime back (can't get the link now) that says memcache can handle huge (facebook uses it) sets of data in memory with very little degradation in performance...my advice is you will need to explore more on memcache, i think it will do the trick.

Either a document datastore (mongo/couchdb), or a persistent key value store (tokyodb, memcachedb etc) may be explored.
No definite recommendations from me as the final solution depends on multiple factors - load, your willingness to learn/deploy a new technology, size of the data...

Seems to me that one approach could be to use memcached to keep the counter, but have a cron running regularly to store the value from memcached to the db or disk. That way you'd get all the performance of memcached, but in the case of a crash you wouldn't lose more than a couple of minutes' data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.