GAE/P: Dealing with eventual consistency - python

In may app, I have the following process:
Get a very long list of people
Create an entity for each person
Send an email to each person (step 2 must be completed before step 3 starts)
Because the list of people is very large, I don't want to put them in the same entity group.
In doing step 3, I can query the list of people like this:
Person.all()
Because of eventual consistency, I might miss some people in step 3. What is a good way to ensure that I am not missing anyone in step 3?
Is there a better solution than this?:
while Person.all().count() < N:
pass
for p in Person.all()
# do whatever
EDIT:
Another possible solution came to mind. I could create a linked list of the people. I can store a link to the first one, he can link to the second one and so one. It seems that the performance would be poor however, because you'd be doing each get separately and wouldn't have the efficiencies of a query.

UPDATE: I reread your post and saw that you don't want to put them all in the same entity group. I'm not sure how to guarantee strong consistency without doing so. You might want to restructure your data so that you don't have to put them in the same entity group, but in several. Perhaps depending on some aspect of a group of Person entities? (e.g., mailing list they are on, type of email being sent, etc.) Does each Person only contain a name and an email address, or are there other properties involved?
Google suggests a a few other alternatives:
If your application is likely to encounter heavier write usage, you may need to consider using other means: for example, you might put recent posts in a memcache with an expiration and display a mix of recent posts from the memcache and the Datastore, or you might cache them in a cookie, put some state in the URL, or something else entirely. The goal is to find a caching solution that provides the data for the current user for the period of time in which the user is posting to your application. Remember, if you do a get, a put, or any operation within a transaction, you will always see the most recently written data.
So it looks like you may want to investigate those possibilities, although I'm not sure how well they would translate to what your app needs.
ORIGINAL POST: Use ancestor queries.
From Google's "Structuring Data for Strong Consistency":
To obtain strongly consistent query results, you need to use an ancestor query limiting the results to a single entity group. This works because entity groups are a unit of consistency as well as transactionality. All data operations are applied to the entire group; an ancestor query won't return its results until the entire entity group is up to date. If your application relies on strongly consistent results for certain queries, you may need to take this into consideration when designing your data model. This page discusses best practices for structuring your data to support strong consistency.
So when you create a Person entity, set a parent for it. I believe you could even just have a specific entity be the "parent" of all the others, and it should give you strong consistency. (Although I like to structure my data a bit with ancestors anyway.)
# Gives you the ancestor key
def ancestor_key(kind, id_or_name):
return db.Key.from_path(kind, id_or_name)
# Kind is the db model your using (should be 'Person' in this case) and
# id_or_name should be the key id or name for the parent
new_person = Person(your_params, parent=ancestor_key('Kind', id_or_name)
You could even do queries at that point for all the entities with the same parent, which is nice. But that should help you get more consistent results regardless.

Related

AppEngine model structure for user/follower relations

I have a users who have "followers". I need to be able to navigate up and down the tree of users/followers. I'm eventually going hit AppEngine's 1mb limit on entity entries if I use ancestor relations if a user has many followers.
What's the best way to structure this data on AppEngine?
You cannot use ancestor relations for a simple reason that your use case allows circular references (I follow you, you follow me).
The solution depends on your expected usage patterns. You can choose between two options:
(A) In each suer entity store a list of IDs of other users that this user is following.
(B) Create a separate entity that has two properties: "User" and"Follower". Every entity will represent a single "connection" between users.
While the first option seems simpler, you may run into exploding indexes problem. Besides, it may turn out to be a more expensive solution as each change in user relationships will require an overwrite of a user entity with updates to all of its other indexes. The second solution does not have these drawbacks, but may require a little extra code.

Strongly consistent queries for root entities in GAE?

I'd like some advice on the best way to do a strongly consistent read/write in Google App Engine.
My data is stored in a class like this.
class UserGroupData(ndb.Model):
users_in_group = ndb.StringProperty(repeated=True)
data = ndb.StringProperty(repeated=True)
I want to write a safe update method for this data. As far as I understand, I need to avoid eventually consistent reads here, because they risk data loss. For example, the following code is unsafe because it uses a vanilla query which is eventually consistent:
def update_data(user_id, additional_data):
entity = UserGroupData.query(UserGroupData.users_in_group==user_id).get()
entity.data.append(additional_data)
entity.put()
If the entity returned by the query is stale, data is lost.
In order to achieve strong consistency, it seems I have a couple of different options. I'd like to know which option is best:
Option 1:
Use get_by_id(), which is always strongly consistent. However, there doesn't seem to be a neat way to do this here. There isn't a clean way to derive the key for UserGroupData directly from a user_id, because the relationship is many-to-one. It also seems kind of brittle and risky to require my external clients to store and send the key for UserGroupData.
Option 2:
Place my entities in an ancestor group, and perform an ancestor query. Something like:
def update_data(user_id, additional_data):
entity = UserGroupData.query(UserGroupData.users_in_group==user_id,
ancestor=ancestor_for_all_ugd_entities()).get()
entity.data.append(additional_data)
entity.put()
I think this should work, but putting all UserGroupData entities into a single ancestor group seems like an extreme thing to do. It results in writes being limited to ~1/sec. This seems like the wrong approach, since each UserGroupData is actually logically independent.
Really what I'd like to do is perform a strongly consistent query for a root entity. Is there some way to do this? I noticed a suggestion in another answer to essentially shard the ancestor group. Is this the best that can be done?
Option 3:
A third option is to do a keys_only query followed by get_by_id(), like so:
def update_data(user_id, additional_data):
entity_key = UserGroupData.query(UserGroupData.users_in_group==user_id,
).get(keys_only=True)
entity = entity_key.get()
entity.data.append(additional_data)
entity.put()
As far as I can see this method is safe from data loss, since my keys are not changing and the get() gives strongly consistent results. However, I haven't seen this approach mentioned anywhere. Is this a reasonable thing to do? Does it have any downsides I need to understand?
I think you are also conflating the issue of inconsistent queries with safe updates of the data.
A query like the one in your example UserGroupData.query(UserGroupData.users_in_group==user_id).get() will always only return one entity, if the user_id is in the group.
If it has only just been added and the index is not up to date then you won't get a record and therefore you won't update the record.
Any update irrespective of the method of fetching the entity should be performed inside a transaction ensuring update consistency.
As to ancestors improving the consistency of the query, it's not obvious if you plan to have multiple UserGroupData entities. In which case why are you doing a get().
So option 3, is probably your best bet, do the keys only query, then inside a transaction do the Key.get() and update. Remember cross group transactions are limited 5 entity groups.
Given this approach if the index the query is based is out of date then 1 of 3 things can happen,
the record you want isn't found because the newly added userid is not reflected in the index.
the record you want is found, the get() will fetch it consistently
the record you want is found, but the userid has actually been removed and the index is out of date. The get() will retrieve the index consistently and the userid is not present.
You code can then decide what course of action.
What is the use case for querying all UserGroupData entities that a particular user is a member of that would require updates ?

How to transfer multiple entities for NDB Model in Google App Engine for python?

I have an NDB model. Once the data in the model becomes stale, I want to remove stale data items from searches and updates. I could have deleted them, which is explained in this SO post, if not for the need to analyze old data later.
I see two choices
adding a boolean status field, and simply mark entities deleted
move entities to a different model
My understanding of the trade off between these two options
mark-deleted is faster
mark-deleted is more error prone: having extra column would require modifying all the queries to exclude entities that are marked deleted. That will increase complexity and probability of bugs.
Question:
Can move-entities option be made fast enough to be comparable to mark-deleted?
Any sample code as to how to move entities between models efficiently?
Update: 2014-05-14, I decided for the time being to use mark-deleted. I figure there is an additional benefit of fewer RPCs.
Related:
How to delete all entities for NDB Model in Google App Engine for python?
You can use a combination, of the solutions you propose although in my head I think its an over engineering.
1) In first place, write a task queue that will update all of your entities with your new field is_deleted with a default value False, this will prevent all the previous entities to return an error when you ask them if they are deleted.
2) Write your queries in a model level, so you don't have to alter them any time you make a change in your model, but only pass the extra parameter you want to filter on when you make the relevant query. You can get an idea from the model of the bootstrap project gae-init. You can query them with is_deleted = False.
3) BigTable's performance will not be affected if you are querying 10 entities or 10 M entities, but if you want to move the deleted ones in an new Entity model you can try to create a crop job so in the end of the day or something copy them somewhere else and remove the original ones. Don't forget that will use your quota and you mind end up paying literally for the clean up.
Keep in mind also that if there are any dependencies on the entities you will move, you will have to update them also. So in my opinion its better to leave them flagged, and index your flag.

How to use High Replication Datastore

Okay, I have watched the video and read the articles in the App Engine documentation (including Using the High Replication Datastore). However I am still completely confused on the practical usage of it. I understand the benefits (from the video) and they sound great. But what I am lacking is a few practical examples. There are plenty of master/slave examples on the web, but very little illustrating (with proper documentation) the high replication datastore. The guestbook code example used in the Using the High Replication Datastore article illustrates the ancestor key by adding a new functionality that the previous guestbook code example does not have (seems you can change guestbook). This just adds to the confusion.
I often use djangoforms on GAE and I was wondering if someone can help me translate all these queries into high replication datastore compatible queries (let's forget for a moment the discussion that not all queries necessarily need to be high replication datastore compatible queries and focus on the example itself).
UPDATE: with high replication datastore compatible queries I refer to queries that always return the latest data and not potential stale data. Using entity groups seems to be the way to go here but as mentioned before, I don't have many practical code examples of how to do this, so that is what I am looking for!
So the queries in this article are:
The main recurring query in this article is:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
which we will translate to:
query = Item.all().order('name') // datastore request
validating the form happens like:
data = ItemForm(data=self.request.POST)
if data.is_valid():
# Save the data, and redirect to the view page
entity = data.save(commit=False)
entity.added_by = users.get_current_user()
entity.put() // datastore request
and getting the latest entry from the datastore for populating a form happens like:
id = int(self.request.get('id'))
item = Item.get(db.Key.from_path('Item', id)) // datastore request
data = ItemForm(data=self.request.POST, instance=item)
So what do I/we need to do to make all these datastore requests compatible with the high replication datastore?
One last thing that is also not clear to me. Using ancestor keys, does this have any impact on the model in datastore. For example, in the guestbook code example they use:
def guestbook_key(guestbook_name=None):
return db.Key.from_path('Guestbook', guestbook_name or 'default_guestbook')
However 'Guestbook' does not exist in the model, so how can you use 'db.Key.from_path' on this and why would this work? Does this change how data is stored in the datastore which I need to keep into account when retrieving the data (e.g. does it add another field I should exclude from showing when using djangoforms)?
Like I said before, this is confusing me a lot and your help is greatly appreciated!
I'm not sure why you think you need to change your queries at all. The documentation that you link to clearly states:
The back end changes, but the datastore API does not change at all. You'll use the same programming interfaces no matter which datastore you're using.
The point of that page is just to say that queries may be out of sync if you don't use entity groups. Your final code snippet is just an example of that - the string 'Guestbook' is exactly an ancestor key. I don't understand why you think it needs to exist in the model. Once again, this is unchanged from the non-HR datastore - it has always been the case that keys are built up from paths, which can consist of arbitrary strings. You probably need to reread the documentation on entity groups and keys.
The changes to use the HRD are not in how queries are made, but in what guarantees are made about what data you get back. The example you give:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
will work in the HRD as well. The catch (basically) is that this kind of query (using either this syntax, or the Item.all() form) can return objects slightly out-of-date. This is probably not a big deal with the guestbook.
Note that if you're getting an object by key directly, it will never be out-of-date. It's only for queries that you can see this issue. You can avoid this problem with queries by placing all the entities that need to be consistent in a single entity group. Note that this limits the rate at which you can write to the entity group.
In answer to your follow-up question, "Guestbook" is the name of the entity.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?
At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.
The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.
If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Categories

Resources