Strongly consistent queries for root entities in GAE?

Strongly consistent queries for root entities in GAE? - python

I'd like some advice on the best way to do a strongly consistent read/write in Google App Engine.
My data is stored in a class like this.
class UserGroupData(ndb.Model):
users_in_group = ndb.StringProperty(repeated=True)
data = ndb.StringProperty(repeated=True)
I want to write a safe update method for this data. As far as I understand, I need to avoid eventually consistent reads here, because they risk data loss. For example, the following code is unsafe because it uses a vanilla query which is eventually consistent:
def update_data(user_id, additional_data):
entity = UserGroupData.query(UserGroupData.users_in_group==user_id).get()
entity.data.append(additional_data)
entity.put()
If the entity returned by the query is stale, data is lost.
In order to achieve strong consistency, it seems I have a couple of different options. I'd like to know which option is best:
Option 1:
Use get_by_id(), which is always strongly consistent. However, there doesn't seem to be a neat way to do this here. There isn't a clean way to derive the key for UserGroupData directly from a user_id, because the relationship is many-to-one. It also seems kind of brittle and risky to require my external clients to store and send the key for UserGroupData.
Option 2:
Place my entities in an ancestor group, and perform an ancestor query. Something like:
def update_data(user_id, additional_data):
entity = UserGroupData.query(UserGroupData.users_in_group==user_id,
ancestor=ancestor_for_all_ugd_entities()).get()
entity.data.append(additional_data)
entity.put()
I think this should work, but putting all UserGroupData entities into a single ancestor group seems like an extreme thing to do. It results in writes being limited to ~1/sec. This seems like the wrong approach, since each UserGroupData is actually logically independent.
Really what I'd like to do is perform a strongly consistent query for a root entity. Is there some way to do this? I noticed a suggestion in another answer to essentially shard the ancestor group. Is this the best that can be done?
Option 3:
A third option is to do a keys_only query followed by get_by_id(), like so:
def update_data(user_id, additional_data):
entity_key = UserGroupData.query(UserGroupData.users_in_group==user_id,
).get(keys_only=True)
entity = entity_key.get()
entity.data.append(additional_data)
entity.put()
As far as I can see this method is safe from data loss, since my keys are not changing and the get() gives strongly consistent results. However, I haven't seen this approach mentioned anywhere. Is this a reasonable thing to do? Does it have any downsides I need to understand?

I think you are also conflating the issue of inconsistent queries with safe updates of the data.
A query like the one in your example UserGroupData.query(UserGroupData.users_in_group==user_id).get() will always only return one entity, if the user_id is in the group.
If it has only just been added and the index is not up to date then you won't get a record and therefore you won't update the record.
Any update irrespective of the method of fetching the entity should be performed inside a transaction ensuring update consistency.
As to ancestors improving the consistency of the query, it's not obvious if you plan to have multiple UserGroupData entities. In which case why are you doing a get().
So option 3, is probably your best bet, do the keys only query, then inside a transaction do the Key.get() and update. Remember cross group transactions are limited 5 entity groups.
Given this approach if the index the query is based is out of date then 1 of 3 things can happen,
the record you want isn't found because the newly added userid is not reflected in the index.
the record you want is found, the get() will fetch it consistently
the record you want is found, but the userid has actually been removed and the index is out of date. The get() will retrieve the index consistently and the userid is not present.
You code can then decide what course of action.
What is the use case for querying all UserGroupData entities that a particular user is a member of that would require updates ?

Related

Does it possible that transaction.atomic does not work as expected?

This is a DRF API View for entry like. When someone like a entry, i will insert a like record into table entry_like, and plus by 1 to field likes_num in another table entry. But, something went wrong that some of the count of entry_like records corresponding to one entry is less than the field likes_num in table entry. I do not know why it does not work as expected even the post method is with decorator transaction.atomic on. Are there some cases that the decorator transaction.atomic does not run as expected?

Yes, I think it is the case that transaction.atomic() does not work the way you expect.
To understand what it does, you have to understand SQL's transaction isolation levels and exactly what behavior they guarantee. You don't mention what database you're using, but PostgreSQL has good documentation on the subject.
Your expectation seems to be that it will work as if the isolation level was SERIALIZABLE. In fact, the default isolation level in Django is READ COMMITTED. And in that isolation level, if you have two of these transactions operating at once, they will both overwrite likes_num with the same number.
One solution is to use an F-object instead of setting likes_num to a specific value. In that case, the new value will be based on whatever value is in the field at the time of the write, rather than what value was in the field at the earlier point when you read the row.
entry.likes_num = F('likes_num') + 1
The other solution is to use select_for_update(), which will lock the entry row. It's better to avoid locks if you can, so I would opt for the F-object version.

I think you need to use F objects
from django.db.models import F
...
entry.likes_num = F('likes_num') + 1
entry.save()
Because you do not have any errors in code execution and two transactions are valid.

Google app engine: better way to make query

Say I have RootEntity, AEntity(child of RootEntity), BEntity(child of AEntity).
class RootEntity(ndb.Model):
rtp = ndb.StringProperty()
class AEntity(ndb.Model):
ap = ndb.IntegerProperty()
class BEntity(ndb.Model):
bp = ndb.StringProperty()
So in different handlers I need to get instances of BEntity with specific ancestor(instance of AEntity).
There is a my query: BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", AEntity.query(ancestor = ndb.Key("RootEntity", 1)).filter(AEntity.ap == int(some_value)).get().key.integer_id()))
How I can to optimize this query? Make it better, may be less sophisticated?
Upd:
This query is a part of function with #ndb.transactional decorator.

You should not use Entity Groups to represent entity relationships.
Entity groups have a special purpose: to define the scope of transactions. They give you ability to update multiple entities transactionally, as long as they are a part of the same entity group (this limitation has been somewhat relaxed with the new XG transactions). They also allow you to use queries within transactions (not available via XG transactions).
The downside of entity groups is that they have an update limitation of 1 write/second.
In your case my suggestion would be to use separate entities and make references between them. The reference should be a Key of the referenced entity as this is type-safe.
Regarding query simplicity: GAE unfortunately does not support JOINs or reference (multi-entity) queries, so you would still need to combine multiple queries together (as you do now).

There is a give and take with ancestor queries. They are a more verbose and messy to deal with but you get a better structure to your data and consistency in your queries.
To simplify this, if your handler knows the BEntity you want to get, just pass around the key.urlsafe() encoded key, it already has all of your ancestor information encoded.
If this is not possible, try possibly restructuring your data. Since these objects are all of the same ancestor, they belong to the same entity group, thus at most you can insert/update ~1 time per second for objects in that entity group. If you require higher throughput or do not require consistent ancestral queries, then try using ndb.KeyProperty to link entities with a reference to a parent rather than as an ancestor. Then you'd only need to get a single parent to query on rather than the parent and the parent's parent.
You should also try and use IDs whenever possible, so you can avoid having to filter for entities in your datastore by properties and just reference them by ID:
BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", int(some_value)))
Here, int(some_value) is the integer ID of the AEntity you used when you created that object. Just be sure that you can ensure the IDs you manually create/use will be unique across all instances of that Model that share the same parent.
EDIT:
To clarify, my last example should have been made more clear in that I was suggesting to restructure the data such that int(some_value) be used as the integer ID of the AEntity rather than storing is as a separate property of the Entity - if possible of course. From the example given, a query is performed for the AEntity objects that have a given integer field value of int(some_value) and executed with a get() - implying that you will always expect a single value return for that integer ID making it a good candidate to use as the integer ID for the key of that object eliminating the need for a query.

How do to explicitly define the query used in subqueryload_all?

I'm using subqueryload/subqueryload_all pretty heavily, and I've run into the edge case where I tend to need to very explicitly define the query that is used during the subqueryload. For example I have a situation where I have posts and comments. My query looks something like this:
posts_q = db.query(Post).options(subqueryload(Post.comments))
As you can see, I'm loading each Post's comments. The problem is that I don't want all of the posts' comments, I need to also take into account a deleted field, and they need to be ordered by create time descending. The only way I have observed this being done, is by adding options to the relationship() declaration between posts and comments. I would prefer not to do this, b/c it means that that relationship cannot be reused everywhere after that, as I have other places in the app where those constraints may not apply.
What I would love to do, is explicitly define the query that subqueryload/subqueryload_all uses to load the posts' comments. I read about DisjointedEagerLoading here, and it looks like I could simply define a special function that takes in the base query, and a query to load the specified relationship. Is this a good route to take for this situation? Anyone ever run into this edge case before?

The answer is that you can define multiple relationships between Posts and Comments:
class Post(...):
active_comments = relationship(Comment,
primary_join=and_(Comment.post_id==Post.post_id, Comment.deleted=False),
order_by=Comment.created.desc())
Then you should be able to subqueryload by that relationship:
posts_q = db.query(Post).options(subqueryload(Post.active_comments))
You can still use the existing .comments relationship elsewhere.

I also had this problem and it took my some time to realize that this is an issue by design. When you say Post.comments then you refer to the relationship that says "these are all the comments of that post". However, now you want to filter them. If you'd now specify that condition somewhere on subqueryload then you are essentially loading only a subset of values into Post.comments. Thus, there will be values missing. Essentially you have a faulty representation of your data in the model.
The question here is how to approach this then, because you obviously need this value somewhere. The way I go is building the subquery myself and then specify special conditions there. That means you get two objects back: The list of posts and the list of comments. That is not a pretty solution, but at least it is not displaying data in a wrong way. If you were to access Post.comments for some reason, you can safely assume it contains all posts.
But there is room for improvement: You might want to have this attached to your class so you don't carry around two variables. The easy way might be to define a second relationship, e.g. published_comments which specifies extra parameters. You could then also control that no-one writes to it, e.g. with attribute events. In these events you could, instead of forbidding manipulation, handle how manipulation is allowed. The only problem might be when updates happen, e.g. when you add a comment to Post.comments then published_comments won't be updated automatically because they are not aware of each other. Again, I'd take events for this if this is a required feature (but with the above ugly solution you would not have that either).
As a last, hybrid, solution you could take the first approach and then just assign those values to your object, e.g. Post.deleted_comments = deleted_comments.
The thing to keep in mind here is that it is generally not a clever idea to manipulate the query the ORM makes as this could lead to problems later on. I have taken this approach and manipulated the queries (with contains_eager this is easily possible) but it has created problems on some points (while generally being functional) so I dropped that approach.

GAE/P: Dealing with eventual consistency

In may app, I have the following process:
Get a very long list of people
Create an entity for each person
Send an email to each person (step 2 must be completed before step 3 starts)
Because the list of people is very large, I don't want to put them in the same entity group.
In doing step 3, I can query the list of people like this:
Person.all()
Because of eventual consistency, I might miss some people in step 3. What is a good way to ensure that I am not missing anyone in step 3?
Is there a better solution than this?:
while Person.all().count() < N:
pass
for p in Person.all()
# do whatever
EDIT:
Another possible solution came to mind. I could create a linked list of the people. I can store a link to the first one, he can link to the second one and so one. It seems that the performance would be poor however, because you'd be doing each get separately and wouldn't have the efficiencies of a query.

UPDATE: I reread your post and saw that you don't want to put them all in the same entity group. I'm not sure how to guarantee strong consistency without doing so. You might want to restructure your data so that you don't have to put them in the same entity group, but in several. Perhaps depending on some aspect of a group of Person entities? (e.g., mailing list they are on, type of email being sent, etc.) Does each Person only contain a name and an email address, or are there other properties involved?
Google suggests a a few other alternatives:
If your application is likely to encounter heavier write usage, you may need to consider using other means: for example, you might put recent posts in a memcache with an expiration and display a mix of recent posts from the memcache and the Datastore, or you might cache them in a cookie, put some state in the URL, or something else entirely. The goal is to find a caching solution that provides the data for the current user for the period of time in which the user is posting to your application. Remember, if you do a get, a put, or any operation within a transaction, you will always see the most recently written data.
So it looks like you may want to investigate those possibilities, although I'm not sure how well they would translate to what your app needs.
ORIGINAL POST: Use ancestor queries.
From Google's "Structuring Data for Strong Consistency":
To obtain strongly consistent query results, you need to use an ancestor query limiting the results to a single entity group. This works because entity groups are a unit of consistency as well as transactionality. All data operations are applied to the entire group; an ancestor query won't return its results until the entire entity group is up to date. If your application relies on strongly consistent results for certain queries, you may need to take this into consideration when designing your data model. This page discusses best practices for structuring your data to support strong consistency.
So when you create a Person entity, set a parent for it. I believe you could even just have a specific entity be the "parent" of all the others, and it should give you strong consistency. (Although I like to structure my data a bit with ancestors anyway.)
# Gives you the ancestor key
def ancestor_key(kind, id_or_name):
return db.Key.from_path(kind, id_or_name)
# Kind is the db model your using (should be 'Person' in this case) and
# id_or_name should be the key id or name for the parent
new_person = Person(your_params, parent=ancestor_key('Kind', id_or_name)
You could even do queries at that point for all the entities with the same parent, which is nice. But that should help you get more consistent results regardless.

Google app engine item / transaction data model

I am working on a GAE Python project which has items and transactions on items.
At first, we tried to use an item kind and a transaction kind with a reference property, but it complicated a lot the queries.
So we switched to an all-in-one version with transaction data stored directly in the item, which resulted in a lot of attributes not being used, as not all items are concerned by transactions.
I expected it to speed up the app, but is this the best way to do this?
Knowing that:
we expect to have lots of transactions and many more items.
we need to check the transaction status (actually stored in the item's status).
there is only one transaction possible per item but there are many different kind of transactions.
Is there a better solution?
EDIT:
The problem is that I mostly query on the items using the transaction attributes but only 2 where clauses at once maximum , also i update the transaction frequently.
Actually i have some thing like this:
class MyItem(db.Model):
owner = db.ReferenceProperty(MyUser)
descr = db.StringProperty()
status = db.IntegerProperty() # contains item status / transaction status
tx_actor = db.emailProperty()
tx_token = db.StringProperty()
latest_tx_date = db.DateTimeProperty()

Since it's a 1-to-1 mapping, it boils down to how many of the attributes of either the item or transaction you need to query by - those attributes need to be indexed, and how often you write your items or transaction.
An example where it's fine to merge them:
- You rarely write your merged item/transaction object.
- You query on a small set of attributes, many of the attributes do not need to be indexed.
- When you do queries, you usually want both the item AND the transaction.
An example where it's a bad idea to merge them:
- Your item has many indexed attributes, but your transaction has very few. But you need to update your transaction frequently. In this case you'd be better off keeping them separate, because every time you write the transaction, you incur all the write costs of updating the indices for the item.
Another option, if you DON'T need to query on the transaction, is to store the transaction as JSON encoded, then you don't need to define all the attributes up front. You could also use the Expando class.
To get better answers, you'd be better off posting examples of what your items/transactions look like, and the types of queries you'd want to run.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.