Google app engine: better way to make query

Google app engine: better way to make query - python

Say I have RootEntity, AEntity(child of RootEntity), BEntity(child of AEntity).
class RootEntity(ndb.Model):
rtp = ndb.StringProperty()
class AEntity(ndb.Model):
ap = ndb.IntegerProperty()
class BEntity(ndb.Model):
bp = ndb.StringProperty()
So in different handlers I need to get instances of BEntity with specific ancestor(instance of AEntity).
There is a my query: BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", AEntity.query(ancestor = ndb.Key("RootEntity", 1)).filter(AEntity.ap == int(some_value)).get().key.integer_id()))
How I can to optimize this query? Make it better, may be less sophisticated?
Upd:
This query is a part of function with #ndb.transactional decorator.

You should not use Entity Groups to represent entity relationships.
Entity groups have a special purpose: to define the scope of transactions. They give you ability to update multiple entities transactionally, as long as they are a part of the same entity group (this limitation has been somewhat relaxed with the new XG transactions). They also allow you to use queries within transactions (not available via XG transactions).
The downside of entity groups is that they have an update limitation of 1 write/second.
In your case my suggestion would be to use separate entities and make references between them. The reference should be a Key of the referenced entity as this is type-safe.
Regarding query simplicity: GAE unfortunately does not support JOINs or reference (multi-entity) queries, so you would still need to combine multiple queries together (as you do now).

There is a give and take with ancestor queries. They are a more verbose and messy to deal with but you get a better structure to your data and consistency in your queries.
To simplify this, if your handler knows the BEntity you want to get, just pass around the key.urlsafe() encoded key, it already has all of your ancestor information encoded.
If this is not possible, try possibly restructuring your data. Since these objects are all of the same ancestor, they belong to the same entity group, thus at most you can insert/update ~1 time per second for objects in that entity group. If you require higher throughput or do not require consistent ancestral queries, then try using ndb.KeyProperty to link entities with a reference to a parent rather than as an ancestor. Then you'd only need to get a single parent to query on rather than the parent and the parent's parent.
You should also try and use IDs whenever possible, so you can avoid having to filter for entities in your datastore by properties and just reference them by ID:
BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", int(some_value)))
Here, int(some_value) is the integer ID of the AEntity you used when you created that object. Just be sure that you can ensure the IDs you manually create/use will be unique across all instances of that Model that share the same parent.
EDIT:
To clarify, my last example should have been made more clear in that I was suggesting to restructure the data such that int(some_value) be used as the integer ID of the AEntity rather than storing is as a separate property of the Entity - if possible of course. From the example given, a query is performed for the AEntity objects that have a given integer field value of int(some_value) and executed with a get() - implying that you will always expect a single value return for that integer ID making it a good candidate to use as the integer ID for the key of that object eliminating the need for a query.

Related

How to optimize lazy loading of related object, if we already have its instance?

I like how Django ORM lazy loads related objects in the queryset, but I guess it's quite unpredictable as it is.
The queryset API doesn't keep the related objects when they are used to make a queryset, thereby fetching them again when accessed later.
Suppose I have a ModelA instance (say instance_a) which is a foreign key (say for_a) of some N instances of ModelB. Now I want to perform query on ModelB which has the given ModelA instance as the foreign key.
Django ORM provides two ways:
Using .filter() on ModelB:
b_qs = ModelB.objects.filter(for_a=instance_a)
for instance_b in b_qs:
instance_b.for_a # <-- fetches the same row for ModelA again
Results in 1 + N queries here.
Using reverse relations on ModelA instance:
b_qs = instance_a.for_a_set.all()
for instance_b in b_qs:
instance_b.for_a # <-- this uses the instance_a from memory
Results in 1 query only here.
While the second way can be used to achieve the result, it's not part of the standard API and not useable for every scenario. For example, if I have instances of 2 foreign keys of ModelB (say, ModelA and ModelC) and I want to get related objects to both of them.
Something like the following works:
ModelB.objects.filter(for_a=instance_a, for_c=instance_c)
I guess it's possible to use .intersection() for this scenario, but I would like a way to achieve this via the standard API. After all, covering such cases would require more code with non-standard queryset functions which may not make sense to the next developer.
So, the first question, is it possible to optimize such scenarios with the the standard API itself?
The second question, if it's not possible right now, can it be added with some tweaks with the QuerySet?
PS: It's my first time asking a question here, so forgive me if I made any mistake.

You could improve the query by using select_related():
b_qs = ModelB.objects.select_related('for_a').filter(for_a=instance_a)
or
b_qs = instance_a.for_a_set.select_related('for_a')
Does that help?

You use .select_related(..) [Django-doc] for ForeignKeys, or .prefetch_related(..) [Django-doc] for something-to-many relations.
With .select_related(..) you will make a LEFT OUTER JOIN at the database side, and fetch records for the two objects, and thus do the deserialization to the proper objects.
ModelB.objects.select_related('for_a').filter(for_a=instance_a)
For relations that are one-to-many (so a reversed ForeignKey), or ManyToManyFields, this is not a good idea, since it could result in a large amount of duplicate objects that are retrieved. This would result in a large answer from the database, and a lot of work at the Python end to deserialize these objects. .prefetch_related will make individual queries, and then do the linking itself.

Strongly consistent queries for root entities in GAE?

I'd like some advice on the best way to do a strongly consistent read/write in Google App Engine.
My data is stored in a class like this.
class UserGroupData(ndb.Model):
users_in_group = ndb.StringProperty(repeated=True)
data = ndb.StringProperty(repeated=True)
I want to write a safe update method for this data. As far as I understand, I need to avoid eventually consistent reads here, because they risk data loss. For example, the following code is unsafe because it uses a vanilla query which is eventually consistent:
def update_data(user_id, additional_data):
entity = UserGroupData.query(UserGroupData.users_in_group==user_id).get()
entity.data.append(additional_data)
entity.put()
If the entity returned by the query is stale, data is lost.
In order to achieve strong consistency, it seems I have a couple of different options. I'd like to know which option is best:
Option 1:
Use get_by_id(), which is always strongly consistent. However, there doesn't seem to be a neat way to do this here. There isn't a clean way to derive the key for UserGroupData directly from a user_id, because the relationship is many-to-one. It also seems kind of brittle and risky to require my external clients to store and send the key for UserGroupData.
Option 2:
Place my entities in an ancestor group, and perform an ancestor query. Something like:
def update_data(user_id, additional_data):
entity = UserGroupData.query(UserGroupData.users_in_group==user_id,
ancestor=ancestor_for_all_ugd_entities()).get()
entity.data.append(additional_data)
entity.put()
I think this should work, but putting all UserGroupData entities into a single ancestor group seems like an extreme thing to do. It results in writes being limited to ~1/sec. This seems like the wrong approach, since each UserGroupData is actually logically independent.
Really what I'd like to do is perform a strongly consistent query for a root entity. Is there some way to do this? I noticed a suggestion in another answer to essentially shard the ancestor group. Is this the best that can be done?
Option 3:
A third option is to do a keys_only query followed by get_by_id(), like so:
def update_data(user_id, additional_data):
entity_key = UserGroupData.query(UserGroupData.users_in_group==user_id,
).get(keys_only=True)
entity = entity_key.get()
entity.data.append(additional_data)
entity.put()
As far as I can see this method is safe from data loss, since my keys are not changing and the get() gives strongly consistent results. However, I haven't seen this approach mentioned anywhere. Is this a reasonable thing to do? Does it have any downsides I need to understand?

I think you are also conflating the issue of inconsistent queries with safe updates of the data.
A query like the one in your example UserGroupData.query(UserGroupData.users_in_group==user_id).get() will always only return one entity, if the user_id is in the group.
If it has only just been added and the index is not up to date then you won't get a record and therefore you won't update the record.
Any update irrespective of the method of fetching the entity should be performed inside a transaction ensuring update consistency.
As to ancestors improving the consistency of the query, it's not obvious if you plan to have multiple UserGroupData entities. In which case why are you doing a get().
So option 3, is probably your best bet, do the keys only query, then inside a transaction do the Key.get() and update. Remember cross group transactions are limited 5 entity groups.
Given this approach if the index the query is based is out of date then 1 of 3 things can happen,
the record you want isn't found because the newly added userid is not reflected in the index.
the record you want is found, the get() will fetch it consistently
the record you want is found, but the userid has actually been removed and the index is out of date. The get() will retrieve the index consistently and the userid is not present.
You code can then decide what course of action.
What is the use case for querying all UserGroupData entities that a particular user is a member of that would require updates ?

Simple explanation of Google App Engine NDB Datastore

I'm creating a Google App Engine application (python) and I'm learning about the general framework. I've been looking at the tutorial and documentation for the NDB datastore, and I'm having some difficulty wrapping my head around the concepts. I have a large background with SQL databases and I've never worked with any other type of data storage system, so I'm thinking that's where I'm running into trouble.
My current understanding is this: The NDB datastore is a collection of entities (analogous to DB records) that have properties (analogous to DB fields/columns). Entities are created using a Model (analogous to a DB schema). Every entity has a key that is generated for it when it is stored. This is where I run into trouble because these keys do not seem to have an analogy to anything in SQL DB concepts. They seem similar to primary keys for tables, but those are more tightly bound to records, and in fact are fields themselves. These NDB keys are not properties of entities, but are considered separate objects from entities. If an entity is stored in the datastore, you can retrieve that entity using its key.
One of my big questions is where do you get the keys for this? Some of the documentation I saw showed examples in which keys were simply created. I don't understand this. It seemed that when entities are stored, the put() method returns a key that can be used later. So how can you just create keys and define ids if the original keys are generated by the datastore?
Another thing that I seem to be struggling with is the concept of ancestry with keys. You can define parent keys of whatever kind you want. Is there a predefined schema for this? For example, if I had a model subclass called 'Person', and I created a key of kind 'Person', can I use that key as a parent of any other type? Like if I wanted a 'Shoe' key to be a child of a 'Person' key, could I also then declare a 'Car' key to be a child of that same 'Person' key? Or will I be unable to after adding the 'Shoe' key?
I'd really just like a simple explanation of the NDB datastore and its API for someone coming from a primarily SQL background.

I think you've overcomplicating things in your mind. When you create an entity, you can either give it a named key that you've chosen yourself, or leave that out and let the datastore choose a numeric ID. Either way, when you call put, the datastore will return the key, which is stored in the form [<entity_kind>, <id_or_name>] (actually this also includes the application ID and any namespace, but I'll leave that out for clarity).
You can make entities members of an entity group by giving them an ancestor. That ancestor doesn't actually have to refer to an existing entity, although it usually does. All that happens with an ancestor is that the entity's key includes the key of the ancestor: so it now looks like [<parent_entity_kind>, <parent_id_or_name>, <entity_kind>, <id_or_name>]. You can now only get the entity by including its parent key. So, in your example, the Shoe entity could be a child of the Person, whether or not that Person has previously been created: it's the child that knows about the ancestor, not the other way round.
(Note that that ancestry path can be extended arbitrarily: the child entity can itself be an ancestor, and so on. In this case, the group is determined by the entity at the top of the tree.)
Saving entities as part of a group has advantages in terms of consistency, in that a query inside an entity group is always guaranteed to be fully consistent, whereas outside the query is only eventually consistent. However, there are also disadvantages, in that the write rate of an entity group is limited to 1 per second for the whole group.

Datastore keys are a little more analogous to internal SQL row identifiers, but of course not entirely. Identifiers in Appengine are a bit like SQL primary keys. To support decentralised concurrent creation of new keys by many application instances in a cloud of servers, AppEngine internally generates the keys to guarantee uniqueness. Your application defines parameters (application identifier, optional namespace, kind and optional entity identifier) which AppEngine uses to seed its key generator. If you do not provide an identifier, AppEngine will generate a unique numeric identifier that you can read.
Eventual consistency takes time so it is occasionally more efficient to request multiple new keys in bulk. AppEngine then generates a range of numeric entity identifiers for you. You can read their values from keys as KeyProperty metadata.
Ancestry is used to group together writes of related entities of all kinds for the purpose of transactions and isolation. There is no predefined schema for this but you are limited to one parent per child.
In your example, one particular Shoe might have a particular Person as parent. Another particular Shoe could have a Horse as parent. And another Shoe might have no parent. Many entities of all kinds can have the same parent, so several Car entities could also have that initial Person as parent. The Datastore is schemaless, so it's up to your application to allow or forbid a Car to have a Horse as parent.
Note that a child knows its parent, but a parent does not know its children, because implementing that would impact scalability.

Google app engine item / transaction data model

I am working on a GAE Python project which has items and transactions on items.
At first, we tried to use an item kind and a transaction kind with a reference property, but it complicated a lot the queries.
So we switched to an all-in-one version with transaction data stored directly in the item, which resulted in a lot of attributes not being used, as not all items are concerned by transactions.
I expected it to speed up the app, but is this the best way to do this?
Knowing that:
we expect to have lots of transactions and many more items.
we need to check the transaction status (actually stored in the item's status).
there is only one transaction possible per item but there are many different kind of transactions.
Is there a better solution?
EDIT:
The problem is that I mostly query on the items using the transaction attributes but only 2 where clauses at once maximum , also i update the transaction frequently.
Actually i have some thing like this:
class MyItem(db.Model):
owner = db.ReferenceProperty(MyUser)
descr = db.StringProperty()
status = db.IntegerProperty() # contains item status / transaction status
tx_actor = db.emailProperty()
tx_token = db.StringProperty()
latest_tx_date = db.DateTimeProperty()

Since it's a 1-to-1 mapping, it boils down to how many of the attributes of either the item or transaction you need to query by - those attributes need to be indexed, and how often you write your items or transaction.
An example where it's fine to merge them:
- You rarely write your merged item/transaction object.
- You query on a small set of attributes, many of the attributes do not need to be indexed.
- When you do queries, you usually want both the item AND the transaction.
An example where it's a bad idea to merge them:
- Your item has many indexed attributes, but your transaction has very few. But you need to update your transaction frequently. In this case you'd be better off keeping them separate, because every time you write the transaction, you incur all the write costs of updating the indices for the item.
Another option, if you DON'T need to query on the transaction, is to store the transaction as JSON encoded, then you don't need to define all the attributes up front. You could also use the Expando class.
To get better answers, you'd be better off posting examples of what your items/transactions look like, and the types of queries you'd want to run.

Discovering referers to SQLAlchemy object

I have a lot of model classes with ralations between them with a CRUD interface to edit. The problem is that some objects can't be deleted since there are other objects refering to them. Sometimes I can setup ON DELETE rule to handle this case, but in most cases I don't want automatic deletion of related objects till they are unbound manually. Anyway, I'd like to present editor a list of objects refering to currently viewed one and highlight those that prevent its deletion due to FOREIGN KEY constraint. Is there a ready solution to automatically discover referers?
Update
The task seems to be quite common (e.g. django ORM shows all dependencies), so I wonder that there is no solution to it yet.
There are two directions suggested:
Enumerate all relations of current object and go through their backref. But there is no guarantee that all relations have backref defined. Moreover, there are some cases when backref is meaningless. Although I can define it everywhere I don't like doing this way and it's not reliable.
(Suggested by van and stephan) Check all tables of MetaData object and collect dependencies from their foreign_keys property (the code of sqlalchemy_schemadisplay can be used as example, thanks to stephan's comments). This will allow to catch all dependencies between tables, but what I need is dependencies between model classes. Some foreign keys are defined in intermediate tables and have no models corresponding to them (used as secondary in relations). Sure, I can go farther and find related model (have to find a way to do it yet), but it looks too complicated.
Solution
Below is a method of base model class (designed for declarative extention) that I use as solution. It is not perfect and doesn't meet all my requirements, but it works for current state of my project. The result is collected as dictionary of dictionaries, so I can show them groupped by objects and their properties. I havn't decided yet whether it's good idea, since the list of referers sometimes is huge and I'm forced to limit it to some reasonable number.
def _get_referers(self):
db = object_session(self)
cls, ident = identity_key(instance=self)
medatada = cls.__table__.metadata
result = {}
# _mapped_models is my extension. It is collected by metaclass, so I didn't
# look for other ways to find all model classes.
for other_class in medatada._mapped_models:
queries = {}
for prop in class_mapper(other_class).iterate_properties:
if not (isinstance(prop, PropertyLoader) and \
issubclass(cls, prop.mapper.class_)):
continue
query = db.query(prop.parent)
comp = prop.comparator
if prop.uselist:
query = query.filter(comp.contains(self))
else:
query = query.filter(comp==self)
count = query.count()
if count:
queries[prop] = (count, query)
if queries:
result[other_class] = queries
return result
Thanks to all who helped me, especially stephan and van.

SQL: I have to absolutely disagree with S.Lott' answer.
I am not aware of out-of-the-box solution, but it is definitely possible to discover all the tables that have ForeignKey constraints to a given table. One needs to use properly the INFORMATION_SCHEMA views such as REFERENTIAL_CONSTRAINTS, KEY_COLUMN_USAGE, TABLE_CONSTRAINTS, etc. See SQL Server example. With some limitations and extensions, most versions of new relational databases support INFORMATION_SCHEMA standard. When you have all the FK information and the object (row) in the table, it is a matter of running few SELECT statements to get all other rows in other tables that refer to given row and prevent it from being deleted.
SqlAlchemy: As noted by stephan in his comment, if you use orm with backref for relations, then it should be quite easy for you to get the list of parent objects that keep reference to the object you are trying to delete, because those objects are basically mapped properties of your object (child1.Parent).
If you work with Table objects of sql alchemy (or not always use backref for relations), then you would have to get values of foreign_keys for all the tables, and then for all those ForeignKeys call references(...) method, providing your table as a parameter. In this way you will find all the FKs (and tables) that have reference to the table your object maps to. Then you can query all the objects that keep reference to your object by constructing the query for each of those FKs.

In general, there's no way to "discover" all of the references in a relational database.
In some databases, they may use declarative referential integrity in the form of explicit Foreign Key or Check constraints.
But there's no requirement to do this. It can be incomplete or inconsistent.
Any query can include a FK relationship that is not declared. Without the universe of all queries, you can't know the relationships which are used but not declared.
To find "referers" in general, you must actually know the database design and have all queries.

For each model class, you can easily see if all its one-to-many relations are empty simply by asking for the list in each case and seeing how many entries it contains. (There is probably a more efficient way implemented in terms of COUNT, too.) If there are any foreign keys relating to the object, and you have your object relations set up correctly, then at least one of these lists will be non-zero in length.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.