High replication delete delay in local server - python

I have this code using Python with --high_replication --use_sqlite:
def delete(self, id):
product = Product.get_by_id(long(id))
if product is None:
self.session.add_flash('Product could not be found', level='error')
self.redirect_to('products')
else:
product.key.delete()
self.session.add_flash('Product is deleted')
self.redirect_to('products')
After the delete I redirect to the 'products' page which is basically a page querying all Products and displaying them.
The only thing I found out is that it is displaying the deleted record as well.
When I refresh the 'products' page then the record is gone.
Are others facing this as well and is there something I can do?
Edit1:
I'm seeing this behaviour only local btw, on production infrastructure this is not the case.
I solved this in the past for the Java sdk using the following jvm arg:
-Ddatastore.default_high_rep_job_policy_unapplied_job_pct=20
Does Python sdk has something similar to simulate the amount of eventual consistency you want your application to see locally?
See https://developers.google.com/appengine/docs/java/tools/devserver

What you're seeing is the eventual consistency behavior of the HRD datastore, which the devserver simulates.
https://developers.google.com/appengine/docs/python/datastore/queries#Data_Consistency
In an eventually consistent query, the indexes used to gather the results are also accessed with eventual consistency. Consequently, such queries may sometimes return entities that no longer match the original query criteria, while strongly consistent queries are always transactionally consistent.

Related

Wait for datastore changes before redirecting

Very similar to this question, except that the answer is not suitable.
I populate a table from a datastore query, then there is a link allowing the user to delete a specific row. Clicking the link goes to a url that deletes the row from the datastore then redirects back to the table.
Changes more often than not aren't shown in the table until reloading again.
Easy solution is to redirect to another page, that uses a javascript redirect to add a delay of a couple of seconds. Other alternative is to send details back to the page like action=delete&key=### and then make sure that item is missed from the table. That's a pain though.
The answer is with ancestor queries.
https://cloud.google.com/appengine/docs/python/datastore/queries#Python_Ancestor_queries
Create the entities with a parent. When one of the entities is deleted, you can run an ancestor query for your table list view which will have strong consistency when data is changed.
Example ancestor query:
tom = Person(key_name='Tom')
photo_query = Photo.all()
photo_query.ancestor(tom)
With the datastore, unless you can use ancestors, you cant guarantee when the indexes will be updated, only the entity itself (for getting by key later) by doing a put without async. Best is a combination of your suggestion where the client takes into account its action to patch the ui, plus maybe using memcache to remember recent actions and patch queries server-side before returning to client.
Here is a different approach. Use Javascript and AJAX. When the user clicks a link, you do two things:
Use Javascript/jQuery to remove the row from the DOM, and
Send an AJAX call to the server to do the appropriate datastore modifications.
It makes for a nice user experience because you are not reloading the page at all.
You might want to consider that there is always room for displaying outdated info in the table: for example displaying the table simultaneously in 2 different windows/tabs, then in one of them deletion is performed, the other will still display a delete link which will cause a 404 if followed.
With this in mind I'd 1st focus on managing the expectations (the user should know that the page may occasionally display outdated info) and then on the user's ability to get an outdated page in sync (refresh button?). Which might make the issue moot.
The delay-based "solutions" are bound to fail sooner or later in race condition scenarios, I wouldn't bother with the extra complexity. Especially for the document where the deletion is done: that's exactly where the user knows that the info is outdated (for free) and would be likely inclined to refresh until the recent change becomes visible.

When can Google Appengine datastore return stale data?

Is there a difference in the results I can expect from this code:
query = MyModel.all(keys_only=True).filter('myFlag', True)
keys = list(query)
models = db.get(keys)
versus this code:
query = MyModel.all().filter('myFlag', True)
models = list(query)
i.e, will models be the same in both?
If not, why not? I had thought that eventual consistency is used to describe how indices for models take a while to update and can therefore be inconsistent with the most recently written data.
But I recently experienced a case where I was actually getting stale data from a query like the second one, where model.myFlag was True for the models retrieved via query but False when I actually got the model via key.
So in that case, where is the data for myFlag coming from?
Is it that getting an entity via key ensures replication across the datastore nodes and returns the latest data, whereas getting it via query simply retrieves the data from the nearest datastore node?
Edit:
I read this article, and assuming the Cloud Datastore works the same way as the Appengine Datastore, the answer to my question is yes, entities returned from queries may have stale values.
https://cloud.google.com/developers/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore#h.tf76fya5nqk8
Yes, as you mentioned queries may return stale values. When doing a query, the datastore chooses performance over consistency.
More in-depth: For an entity group, each node has a log of writes which have not been applied yet. When you execute a read or an ancestor query, entity groups that are involved first have their logs applied. However when you execute a normal query the results could be from any entity group so the entity groups are not caught up. Be careful about using the first code example though, the indexes that are used to actually find those entities may not be up-to-date. So it is very possible to not get all entities with myFlag = True. If you are interested, I would recommend reading the Megastore paper.

AppEngine NDB Query return different Results

I have a query in my live app that has gone "odd"...
Running 1.8.4 SDK... 1.8.5 live instance using Python 2.7
Measurement is an NDB model... with a string property called status and a key property called asset....
(Deep in my handler code.... )
cursor=None
limit=10
asset_key = <a key to an actual asset>
qry = Measurement.query(
Measurement.status=='PENDING',
Measurement.asset=asset_key)
results, cursor, more = qry.fetch_page(page_size=limit, start_cursor=cursor)
Now the weird thing is if I run this sometimes I get 4 items and sometimes only 1. (the right answer is 4)....
The dump of the query is exactly the same ... cursor is set to None... limit is always the same....same handler...same query and no new records in between each query. Fresh instance (eg 1st time + no other users)
Each query is only separated by seconds yet results a different.
Am I missing something here... has anyone else experienced this? Is this some sort of corrupt index? (It is a relatively large "table" with 482,911 items) Is NDB caching a cursor variable???
Very very odd.
Queries do not look up values in any cache. However, query results are written back to the in-context cache if the cache policy says so (as per the docs). https://developers.google.com/appengine/docs/python/ndb/cache#incontext
Perhaps review the caching policy for the entity in question. However, from your snippet I'm unsure if your query is strongly consistent. That is more likely the cause of this issue: https://developers.google.com/appengine/docs/python/datastore/structuring_for_strong_consistency

Optimize Django

I have a problem with the speed of page loading.
Now it takes about 7 seconds to load the pages and 2~3 seconds is Django processing.
Obvious thing to blame is my lack of knowledge of architecture, execute average 50 queries, as shown by "Django debug tool bar" when accessing the pages but most of the queries are like "yesterday`s snapshot(group by something)" or "daily snapshot(group by something) before yesterday" and doesn't have to be updated each time.
I am coming out of idea using memory caching or create new table for prepare-possible type of data.
Is there any convention or Design Pattern for this kind of issue?
sample queries are these( I believe they must not query each time on yesterdays data or last month`s data):
SELECT `sample_salestarget`.`id`, `sample_salestarget`.`country_id`, `sample_salestarget`.`year`, `sample_salestarget`.`month`, `sample_salestarget`.`sales` FROM `sample_salestarget` WHERE (`sample_salestarget`.`country_id` = "abc" AND `sample_salestarget`.`month` = 8 AND `sample_salestarget`.`year` = 2012 )
SELECT `sample_dailysummary`.`id`, `sample_dailysummary`.`country_id`, `sample_dailysummary`.`date`, `sample_dailysummary`.`pv_day`, `sample_dailysummary`.`pv_week`, `sample_dailysummary`.`pv_month`, `sample_dailysummary`.`active_uu_day`, `sample_dailysummary`.`active_uu_week`, `sample_dailysummary`.`active_uu_month`, `sample_dailysummary`.`active_uu_7days`, `sample_dailysummary`.`active_uu_30days`, `sample_dailysummary`.`paid_uu_day`, `sample_dailysummary`.`paid_uu_week`, `sample_dailysummary`.`paid_uu_month`, `sample_dailysummary`.`sales_day`, `sample_dailysummary`.`sales_week`, `sample_dailysummary`.`sales_month`, `sample_dailysummary`.`register_uu_day`, `sample_dailysummary`.`register_uu_week`, `sample_dailysummary`.`register_uu_month`, `sample_dailysummary`.`pay_count_day`, `sample_dailysummary`.`pay_count_week`, `sample_dailysummary`.`pay_count_month`, `sample_dailysummary`.`total_user`, `sample_dailysummary`.`inv_access_uu`, `sample_dailysummary`.`inv_sender_uu`, `sample_dailysummary`.`inv_accepted_uu`, `sample_dailysummary`.`inv_send_count`, `sample_dailysummary`.`memo`, `sample_dailysummary`.`first_charge_uu` FROM `sample_dailysummary` WHERE `sample_dailysummary`.`date` = 2012-09-07 AND `sample_dailysummary`.`country_id` = "abc" )
Using Memcached can really speed things up for you. However, that does come with it's problems. You have to be extra careful on dynamic pages about explicitly invalidating caches whenever required.
Along with Memcached, try johnny-cache which does a very good job of caching your django ORM queries
Also, make use of Django's session variables as far as possible. (Try the cached_db session engine if you're using Memcached.) You could save objects (like your user profile settings) which stay consistent throughout a session. This way you're reducing the number of sql calls again.
And if you really really need quick pageloads.. Maybe try loading your page and then asynchronously calling your sql statements using Celery and load your results in an AJAXy manner.
If this is a production application to be exposed to the internet, and you can't reduce the number of queries you make then you should at least reuse the answers, I would suggest using django's built in DB cache to store database results in ram using memcached. If this is a local app then i would suggest django's ram based cache. the reason for this is memcached is able to be scaled a lot further than django's but django's requires little setup
Caching for Django

How to use High Replication Datastore

Okay, I have watched the video and read the articles in the App Engine documentation (including Using the High Replication Datastore). However I am still completely confused on the practical usage of it. I understand the benefits (from the video) and they sound great. But what I am lacking is a few practical examples. There are plenty of master/slave examples on the web, but very little illustrating (with proper documentation) the high replication datastore. The guestbook code example used in the Using the High Replication Datastore article illustrates the ancestor key by adding a new functionality that the previous guestbook code example does not have (seems you can change guestbook). This just adds to the confusion.
I often use djangoforms on GAE and I was wondering if someone can help me translate all these queries into high replication datastore compatible queries (let's forget for a moment the discussion that not all queries necessarily need to be high replication datastore compatible queries and focus on the example itself).
UPDATE: with high replication datastore compatible queries I refer to queries that always return the latest data and not potential stale data. Using entity groups seems to be the way to go here but as mentioned before, I don't have many practical code examples of how to do this, so that is what I am looking for!
So the queries in this article are:
The main recurring query in this article is:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
which we will translate to:
query = Item.all().order('name') // datastore request
validating the form happens like:
data = ItemForm(data=self.request.POST)
if data.is_valid():
# Save the data, and redirect to the view page
entity = data.save(commit=False)
entity.added_by = users.get_current_user()
entity.put() // datastore request
and getting the latest entry from the datastore for populating a form happens like:
id = int(self.request.get('id'))
item = Item.get(db.Key.from_path('Item', id)) // datastore request
data = ItemForm(data=self.request.POST, instance=item)
So what do I/we need to do to make all these datastore requests compatible with the high replication datastore?
One last thing that is also not clear to me. Using ancestor keys, does this have any impact on the model in datastore. For example, in the guestbook code example they use:
def guestbook_key(guestbook_name=None):
return db.Key.from_path('Guestbook', guestbook_name or 'default_guestbook')
However 'Guestbook' does not exist in the model, so how can you use 'db.Key.from_path' on this and why would this work? Does this change how data is stored in the datastore which I need to keep into account when retrieving the data (e.g. does it add another field I should exclude from showing when using djangoforms)?
Like I said before, this is confusing me a lot and your help is greatly appreciated!
I'm not sure why you think you need to change your queries at all. The documentation that you link to clearly states:
The back end changes, but the datastore API does not change at all. You'll use the same programming interfaces no matter which datastore you're using.
The point of that page is just to say that queries may be out of sync if you don't use entity groups. Your final code snippet is just an example of that - the string 'Guestbook' is exactly an ancestor key. I don't understand why you think it needs to exist in the model. Once again, this is unchanged from the non-HR datastore - it has always been the case that keys are built up from paths, which can consist of arbitrary strings. You probably need to reread the documentation on entity groups and keys.
The changes to use the HRD are not in how queries are made, but in what guarantees are made about what data you get back. The example you give:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
will work in the HRD as well. The catch (basically) is that this kind of query (using either this syntax, or the Item.all() form) can return objects slightly out-of-date. This is probably not a big deal with the guestbook.
Note that if you're getting an object by key directly, it will never be out-of-date. It's only for queries that you can see this issue. You can avoid this problem with queries by placing all the entities that need to be consistent in a single entity group. Note that this limits the rate at which you can write to the entity group.
In answer to your follow-up question, "Guestbook" is the name of the entity.

Categories

Resources