Okay, I have watched the video and read the articles in the App Engine documentation (including Using the High Replication Datastore). However I am still completely confused on the practical usage of it. I understand the benefits (from the video) and they sound great. But what I am lacking is a few practical examples. There are plenty of master/slave examples on the web, but very little illustrating (with proper documentation) the high replication datastore. The guestbook code example used in the Using the High Replication Datastore article illustrates the ancestor key by adding a new functionality that the previous guestbook code example does not have (seems you can change guestbook). This just adds to the confusion.
I often use djangoforms on GAE and I was wondering if someone can help me translate all these queries into high replication datastore compatible queries (let's forget for a moment the discussion that not all queries necessarily need to be high replication datastore compatible queries and focus on the example itself).
UPDATE: with high replication datastore compatible queries I refer to queries that always return the latest data and not potential stale data. Using entity groups seems to be the way to go here but as mentioned before, I don't have many practical code examples of how to do this, so that is what I am looking for!
So the queries in this article are:
The main recurring query in this article is:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
which we will translate to:
query = Item.all().order('name') // datastore request
validating the form happens like:
data = ItemForm(data=self.request.POST)
if data.is_valid():
# Save the data, and redirect to the view page
entity = data.save(commit=False)
entity.added_by = users.get_current_user()
entity.put() // datastore request
and getting the latest entry from the datastore for populating a form happens like:
id = int(self.request.get('id'))
item = Item.get(db.Key.from_path('Item', id)) // datastore request
data = ItemForm(data=self.request.POST, instance=item)
So what do I/we need to do to make all these datastore requests compatible with the high replication datastore?
One last thing that is also not clear to me. Using ancestor keys, does this have any impact on the model in datastore. For example, in the guestbook code example they use:
def guestbook_key(guestbook_name=None):
return db.Key.from_path('Guestbook', guestbook_name or 'default_guestbook')
However 'Guestbook' does not exist in the model, so how can you use 'db.Key.from_path' on this and why would this work? Does this change how data is stored in the datastore which I need to keep into account when retrieving the data (e.g. does it add another field I should exclude from showing when using djangoforms)?
Like I said before, this is confusing me a lot and your help is greatly appreciated!
I'm not sure why you think you need to change your queries at all. The documentation that you link to clearly states:
The back end changes, but the datastore API does not change at all. You'll use the same programming interfaces no matter which datastore you're using.
The point of that page is just to say that queries may be out of sync if you don't use entity groups. Your final code snippet is just an example of that - the string 'Guestbook' is exactly an ancestor key. I don't understand why you think it needs to exist in the model. Once again, this is unchanged from the non-HR datastore - it has always been the case that keys are built up from paths, which can consist of arbitrary strings. You probably need to reread the documentation on entity groups and keys.
The changes to use the HRD are not in how queries are made, but in what guarantees are made about what data you get back. The example you give:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
will work in the HRD as well. The catch (basically) is that this kind of query (using either this syntax, or the Item.all() form) can return objects slightly out-of-date. This is probably not a big deal with the guestbook.
Note that if you're getting an object by key directly, it will never be out-of-date. It's only for queries that you can see this issue. You can avoid this problem with queries by placing all the entities that need to be consistent in a single entity group. Note that this limits the rate at which you can write to the entity group.
In answer to your follow-up question, "Guestbook" is the name of the entity.
Related
I am new to GAE. I started working on NDB data store service. But the Parent key structure of it really confusing me. I also watched some tutorials on YouTube but they just explain its documentation.
I also followed the documentation but still it is not clear to me. It is the link which i explored.
Google App Engine NDB Data Store Service
NDB datastore is a distributed system. Absolute data consistency is very hard for distributed systems in general. By default NDB is eventually consistent. This means that by default:
If you add a record it may not appear immediately in a query
You cannot do transactions across multiple records by default
If you have more strict requirements you can define groups of entities by giving them the same parent key and specifying it in queries. You are then able to get consistent behaviour within these groups.
It is often better to not to use parent keys at all since they come with a heavy performance penalty. Most of the time apps do not need parent keys.
Quote from Entities, Properties, and Keys
There is a write throughput limit of about one transaction per second within a single entity group. This limitation exists because Datastore performs masterless, synchronous replication of each entity group over a wide geographic area to provide high reliability and fault tolerance.
Is there a difference in the results I can expect from this code:
query = MyModel.all(keys_only=True).filter('myFlag', True)
keys = list(query)
models = db.get(keys)
versus this code:
query = MyModel.all().filter('myFlag', True)
models = list(query)
i.e, will models be the same in both?
If not, why not? I had thought that eventual consistency is used to describe how indices for models take a while to update and can therefore be inconsistent with the most recently written data.
But I recently experienced a case where I was actually getting stale data from a query like the second one, where model.myFlag was True for the models retrieved via query but False when I actually got the model via key.
So in that case, where is the data for myFlag coming from?
Is it that getting an entity via key ensures replication across the datastore nodes and returns the latest data, whereas getting it via query simply retrieves the data from the nearest datastore node?
Edit:
I read this article, and assuming the Cloud Datastore works the same way as the Appengine Datastore, the answer to my question is yes, entities returned from queries may have stale values.
https://cloud.google.com/developers/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore#h.tf76fya5nqk8
Yes, as you mentioned queries may return stale values. When doing a query, the datastore chooses performance over consistency.
More in-depth: For an entity group, each node has a log of writes which have not been applied yet. When you execute a read or an ancestor query, entity groups that are involved first have their logs applied. However when you execute a normal query the results could be from any entity group so the entity groups are not caught up. Be careful about using the first code example though, the indexes that are used to actually find those entities may not be up-to-date. So it is very possible to not get all entities with myFlag = True. If you are interested, I would recommend reading the Megastore paper.
I have a query in my live app that has gone "odd"...
Running 1.8.4 SDK... 1.8.5 live instance using Python 2.7
Measurement is an NDB model... with a string property called status and a key property called asset....
(Deep in my handler code.... )
cursor=None
limit=10
asset_key = <a key to an actual asset>
qry = Measurement.query(
Measurement.status=='PENDING',
Measurement.asset=asset_key)
results, cursor, more = qry.fetch_page(page_size=limit, start_cursor=cursor)
Now the weird thing is if I run this sometimes I get 4 items and sometimes only 1. (the right answer is 4)....
The dump of the query is exactly the same ... cursor is set to None... limit is always the same....same handler...same query and no new records in between each query. Fresh instance (eg 1st time + no other users)
Each query is only separated by seconds yet results a different.
Am I missing something here... has anyone else experienced this? Is this some sort of corrupt index? (It is a relatively large "table" with 482,911 items) Is NDB caching a cursor variable???
Very very odd.
Queries do not look up values in any cache. However, query results are written back to the in-context cache if the cache policy says so (as per the docs). https://developers.google.com/appengine/docs/python/ndb/cache#incontext
Perhaps review the caching policy for the entity in question. However, from your snippet I'm unsure if your query is strongly consistent. That is more likely the cause of this issue: https://developers.google.com/appengine/docs/python/datastore/structuring_for_strong_consistency
I have this code using Python with --high_replication --use_sqlite:
def delete(self, id):
product = Product.get_by_id(long(id))
if product is None:
self.session.add_flash('Product could not be found', level='error')
self.redirect_to('products')
else:
product.key.delete()
self.session.add_flash('Product is deleted')
self.redirect_to('products')
After the delete I redirect to the 'products' page which is basically a page querying all Products and displaying them.
The only thing I found out is that it is displaying the deleted record as well.
When I refresh the 'products' page then the record is gone.
Are others facing this as well and is there something I can do?
Edit1:
I'm seeing this behaviour only local btw, on production infrastructure this is not the case.
I solved this in the past for the Java sdk using the following jvm arg:
-Ddatastore.default_high_rep_job_policy_unapplied_job_pct=20
Does Python sdk has something similar to simulate the amount of eventual consistency you want your application to see locally?
See https://developers.google.com/appengine/docs/java/tools/devserver
What you're seeing is the eventual consistency behavior of the HRD datastore, which the devserver simulates.
https://developers.google.com/appengine/docs/python/datastore/queries#Data_Consistency
In an eventually consistent query, the indexes used to gather the results are also accessed with eventual consistency. Consequently, such queries may sometimes return entities that no longer match the original query criteria, while strongly consistent queries are always transactionally consistent.
In may app, I have the following process:
Get a very long list of people
Create an entity for each person
Send an email to each person (step 2 must be completed before step 3 starts)
Because the list of people is very large, I don't want to put them in the same entity group.
In doing step 3, I can query the list of people like this:
Person.all()
Because of eventual consistency, I might miss some people in step 3. What is a good way to ensure that I am not missing anyone in step 3?
Is there a better solution than this?:
while Person.all().count() < N:
pass
for p in Person.all()
# do whatever
EDIT:
Another possible solution came to mind. I could create a linked list of the people. I can store a link to the first one, he can link to the second one and so one. It seems that the performance would be poor however, because you'd be doing each get separately and wouldn't have the efficiencies of a query.
UPDATE: I reread your post and saw that you don't want to put them all in the same entity group. I'm not sure how to guarantee strong consistency without doing so. You might want to restructure your data so that you don't have to put them in the same entity group, but in several. Perhaps depending on some aspect of a group of Person entities? (e.g., mailing list they are on, type of email being sent, etc.) Does each Person only contain a name and an email address, or are there other properties involved?
Google suggests a a few other alternatives:
If your application is likely to encounter heavier write usage, you may need to consider using other means: for example, you might put recent posts in a memcache with an expiration and display a mix of recent posts from the memcache and the Datastore, or you might cache them in a cookie, put some state in the URL, or something else entirely. The goal is to find a caching solution that provides the data for the current user for the period of time in which the user is posting to your application. Remember, if you do a get, a put, or any operation within a transaction, you will always see the most recently written data.
So it looks like you may want to investigate those possibilities, although I'm not sure how well they would translate to what your app needs.
ORIGINAL POST: Use ancestor queries.
From Google's "Structuring Data for Strong Consistency":
To obtain strongly consistent query results, you need to use an ancestor query limiting the results to a single entity group. This works because entity groups are a unit of consistency as well as transactionality. All data operations are applied to the entire group; an ancestor query won't return its results until the entire entity group is up to date. If your application relies on strongly consistent results for certain queries, you may need to take this into consideration when designing your data model. This page discusses best practices for structuring your data to support strong consistency.
So when you create a Person entity, set a parent for it. I believe you could even just have a specific entity be the "parent" of all the others, and it should give you strong consistency. (Although I like to structure my data a bit with ancestors anyway.)
# Gives you the ancestor key
def ancestor_key(kind, id_or_name):
return db.Key.from_path(kind, id_or_name)
# Kind is the db model your using (should be 'Person' in this case) and
# id_or_name should be the key id or name for the parent
new_person = Person(your_params, parent=ancestor_key('Kind', id_or_name)
You could even do queries at that point for all the entities with the same parent, which is nice. But that should help you get more consistent results regardless.