AppEngine NDB Query return different Results

AppEngine NDB Query return different Results - python

I have a query in my live app that has gone "odd"...
Running 1.8.4 SDK... 1.8.5 live instance using Python 2.7
Measurement is an NDB model... with a string property called status and a key property called asset....
(Deep in my handler code.... )
cursor=None
limit=10
asset_key = <a key to an actual asset>
qry = Measurement.query(
Measurement.status=='PENDING',
Measurement.asset=asset_key)
results, cursor, more = qry.fetch_page(page_size=limit, start_cursor=cursor)
Now the weird thing is if I run this sometimes I get 4 items and sometimes only 1. (the right answer is 4)....
The dump of the query is exactly the same ... cursor is set to None... limit is always the same....same handler...same query and no new records in between each query. Fresh instance (eg 1st time + no other users)
Each query is only separated by seconds yet results a different.
Am I missing something here... has anyone else experienced this? Is this some sort of corrupt index? (It is a relatively large "table" with 482,911 items) Is NDB caching a cursor variable???
Very very odd.

Queries do not look up values in any cache. However, query results are written back to the in-context cache if the cache policy says so (as per the docs). https://developers.google.com/appengine/docs/python/ndb/cache#incontext
Perhaps review the caching policy for the entity in question. However, from your snippet I'm unsure if your query is strongly consistent. That is more likely the cause of this issue: https://developers.google.com/appengine/docs/python/datastore/structuring_for_strong_consistency

Related

Having issues doing fast enough db inserts inside a Flask endpoint

I have an HTTP POST endpoint in Flask which needs to insert whatever data comes in into a database. This endpoint can receive up to hundreds of requests per second. Doing an insert every time a new request comes takes too much time. I have thought that doing a bulk insert every 1000 request with all the previous 1000 request data should work like some sort of caching mechanism. I have tried saving 1000 incoming data objects into some collection and then doing a bulk insert once the array is 'full'.
Currently my code looks like this:
#app.route('/user', methods=['POST'])
def add_user():
firstname = request.json['firstname']
lastname = request.json['lastname']
email = request.json['email']
usr = User(firstname, lastname, email)
global bulk
bulk.append(usr)
if len(bulk) > 1000:
bulk = []
db.session.bulk_save_objects(bulk)
db.session.commit()
return user_schema.jsonify(usr)
The problem I'm having with this is that the database becomes 'locked', and I really don't know if this is a good solution but just poorly implemented, or a stupid idea.
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked

Your error message indicates that you are using an sqlite DB with SQLAlchemy. You may want to try changing the setting of the sqlite "synchronous" flag to turn syncing OFF. This can speed INSERT queries up dramatically, but it does come with the increased risk of data loss. See https://sqlite.org/pragma.html#pragma_synchronous for more details.
With synchronous OFF (0), SQLite continues without syncing as soon as
it has handed data off to the operating system. If the application
running SQLite crashes, the data will be safe, but the database might
become corrupted if the operating system crashes or the computer loses
power before that data has been written to the disk surface. On the
other hand, commits can be orders of magnitude faster with synchronous
OFF
If your application and use case can tolerate the increased risks, then disabling syncing may negate the need for bulk inserts.
See "How to set SQLite PRAGMA statements with SQLAlchemy": How to set SQLite PRAGMA statements with SQLAlchemy

Once I moved the code on AWS and used the Aurora instance as the database, the problems went away, so I suppose it's safe to conclude that the issue were solely related to my sqlite3 instance.
The final solution gave me satisfactory results and I ended up changing only this line:
db.session.bulk_save_objects(bulk)
to this:
db.session.save_all(bulk)
I can now safely do up to 400 or more (haven't tested for more) calls on that specific endpoints, all ending with valid inserts, per second.

Not an expert on this, but seems like database has reached its concurrency limits. You can try using Pony for better concurrency and transaction management
https://docs.ponyorm.org/transactions.html
By default Pony uses the optimistic concurrency control concept for increasing performance. With this concept, Pony doesn’t acquire locks on database rows. Instead it verifies that no other transaction has modified the data it has read or is trying to modify.

When can Google Appengine datastore return stale data?

Is there a difference in the results I can expect from this code:
query = MyModel.all(keys_only=True).filter('myFlag', True)
keys = list(query)
models = db.get(keys)
versus this code:
query = MyModel.all().filter('myFlag', True)
models = list(query)
i.e, will models be the same in both?
If not, why not? I had thought that eventual consistency is used to describe how indices for models take a while to update and can therefore be inconsistent with the most recently written data.
But I recently experienced a case where I was actually getting stale data from a query like the second one, where model.myFlag was True for the models retrieved via query but False when I actually got the model via key.
So in that case, where is the data for myFlag coming from?
Is it that getting an entity via key ensures replication across the datastore nodes and returns the latest data, whereas getting it via query simply retrieves the data from the nearest datastore node?
Edit:
I read this article, and assuming the Cloud Datastore works the same way as the Appengine Datastore, the answer to my question is yes, entities returned from queries may have stale values.
https://cloud.google.com/developers/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore#h.tf76fya5nqk8

Yes, as you mentioned queries may return stale values. When doing a query, the datastore chooses performance over consistency.
More in-depth: For an entity group, each node has a log of writes which have not been applied yet. When you execute a read or an ancestor query, entity groups that are involved first have their logs applied. However when you execute a normal query the results could be from any entity group so the entity groups are not caught up. Be careful about using the first code example though, the indexes that are used to actually find those entities may not be up-to-date. So it is very possible to not get all entities with myFlag = True. If you are interested, I would recommend reading the Megastore paper.

High replication delete delay in local server

I have this code using Python with --high_replication --use_sqlite:
def delete(self, id):
product = Product.get_by_id(long(id))
if product is None:
self.session.add_flash('Product could not be found', level='error')
self.redirect_to('products')
else:
product.key.delete()
self.session.add_flash('Product is deleted')
self.redirect_to('products')
After the delete I redirect to the 'products' page which is basically a page querying all Products and displaying them.
The only thing I found out is that it is displaying the deleted record as well.
When I refresh the 'products' page then the record is gone.
Are others facing this as well and is there something I can do?
Edit1:
I'm seeing this behaviour only local btw, on production infrastructure this is not the case.
I solved this in the past for the Java sdk using the following jvm arg:
-Ddatastore.default_high_rep_job_policy_unapplied_job_pct=20
Does Python sdk has something similar to simulate the amount of eventual consistency you want your application to see locally?
See https://developers.google.com/appengine/docs/java/tools/devserver

What you're seeing is the eventual consistency behavior of the HRD datastore, which the devserver simulates.
https://developers.google.com/appengine/docs/python/datastore/queries#Data_Consistency
In an eventually consistent query, the indexes used to gather the results are also accessed with eventual consistency. Consequently, such queries may sometimes return entities that no longer match the original query criteria, while strongly consistent queries are always transactionally consistent.

How to use High Replication Datastore

Okay, I have watched the video and read the articles in the App Engine documentation (including Using the High Replication Datastore). However I am still completely confused on the practical usage of it. I understand the benefits (from the video) and they sound great. But what I am lacking is a few practical examples. There are plenty of master/slave examples on the web, but very little illustrating (with proper documentation) the high replication datastore. The guestbook code example used in the Using the High Replication Datastore article illustrates the ancestor key by adding a new functionality that the previous guestbook code example does not have (seems you can change guestbook). This just adds to the confusion.
I often use djangoforms on GAE and I was wondering if someone can help me translate all these queries into high replication datastore compatible queries (let's forget for a moment the discussion that not all queries necessarily need to be high replication datastore compatible queries and focus on the example itself).
UPDATE: with high replication datastore compatible queries I refer to queries that always return the latest data and not potential stale data. Using entity groups seems to be the way to go here but as mentioned before, I don't have many practical code examples of how to do this, so that is what I am looking for!
So the queries in this article are:
The main recurring query in this article is:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
which we will translate to:
query = Item.all().order('name') // datastore request
validating the form happens like:
data = ItemForm(data=self.request.POST)
if data.is_valid():
# Save the data, and redirect to the view page
entity = data.save(commit=False)
entity.added_by = users.get_current_user()
entity.put() // datastore request
and getting the latest entry from the datastore for populating a form happens like:
id = int(self.request.get('id'))
item = Item.get(db.Key.from_path('Item', id)) // datastore request
data = ItemForm(data=self.request.POST, instance=item)
So what do I/we need to do to make all these datastore requests compatible with the high replication datastore?
One last thing that is also not clear to me. Using ancestor keys, does this have any impact on the model in datastore. For example, in the guestbook code example they use:
def guestbook_key(guestbook_name=None):
return db.Key.from_path('Guestbook', guestbook_name or 'default_guestbook')
However 'Guestbook' does not exist in the model, so how can you use 'db.Key.from_path' on this and why would this work? Does this change how data is stored in the datastore which I need to keep into account when retrieving the data (e.g. does it add another field I should exclude from showing when using djangoforms)?
Like I said before, this is confusing me a lot and your help is greatly appreciated!

I'm not sure why you think you need to change your queries at all. The documentation that you link to clearly states:
The back end changes, but the datastore API does not change at all. You'll use the same programming interfaces no matter which datastore you're using.
The point of that page is just to say that queries may be out of sync if you don't use entity groups. Your final code snippet is just an example of that - the string 'Guestbook' is exactly an ancestor key. I don't understand why you think it needs to exist in the model. Once again, this is unchanged from the non-HR datastore - it has always been the case that keys are built up from paths, which can consist of arbitrary strings. You probably need to reread the documentation on entity groups and keys.

The changes to use the HRD are not in how queries are made, but in what guarantees are made about what data you get back. The example you give:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
will work in the HRD as well. The catch (basically) is that this kind of query (using either this syntax, or the Item.all() form) can return objects slightly out-of-date. This is probably not a big deal with the guestbook.
Note that if you're getting an object by key directly, it will never be out-of-date. It's only for queries that you can see this issue. You can avoid this problem with queries by placing all the entities that need to be consistent in a single entity group. Note that this limits the rate at which you can write to the entity group.
In answer to your follow-up question, "Guestbook" is the name of the entity.

Why does not postgresql start returning rows immediately?

The following query returns data right away:
SELECT time, value from data order by time limit 100;
Without the limit clause, it takes a long time before the server starts returning rows:
SELECT time, value from data order by time;
I observe this both by using the query tool (psql) and when querying using an API.
Questions/issues:
The amount of work the server has to do before starting to return rows should be the same for both select statements. Correct?
If so, why is there a delay in case 2?
Is there some fundamental RDBMS issue that I do not understand?
Is there a way I can make postgresql start returning result rows to the client without pause, also for case 2?
EDIT (see below). It looks like setFetchSize is the key to solving this. In my case I execute the query from python, using SQLAlchemy. How can I set that option for a single query (executed by session.execute)? I use the psycopg2 driver.
The column time is the primary key, BTW.
EDIT:
I believe this excerpt from the JDBC driver documentation describes the problem and hints at a solution (I still need help - see the last bullet list item above):
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
and
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();
// Turn use of the cursor on.
st.setFetchSize(50);

The psycopg2 dbapi driver buffers the whole query result before returning any rows. You'll need to use server side cursor to incrementally fetch results. For SQLAlchemy see server_side_cursors in the docs and if you're using the ORM the Query.yield_per() method.
SQLAlchemy currently doesn't have an option to set that per single query, but there is a ticket with a patch for implementing that.

In theory, because your ORDER BY is by primary key, a sort of the results should not be necessary, and the DB could indeed return data right away in key order.
I would expect a capable DB of noticing this, and optimizing for it. It seems that PGSQL is not. * shrug *
You don't notice any impact if you have LIMIT 100 because it's very quick to pull those 100 results out of the DB, and you won't notice any delay if they're first gathered up and sorted before being shipped out to your client.
I suggest trying to drop the ORDER BY. Chances are, your results will be correctly ordered by time anyway (there may even be a standard or specification that mandates this, given your PK), and you might get your results more quickly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

AppEngine NDB Query return different Results - python

Related

Having issues doing fast enough db inserts inside a Flask endpoint

When can Google Appengine datastore return stale data?

High replication delete delay in local server

How to use High Replication Datastore

Why does not postgresql start returning rows immediately?

Categories

Resources