I'm creating a Google App Engine application (python) and I'm learning about the general framework. I've been looking at the tutorial and documentation for the NDB datastore, and I'm having some difficulty wrapping my head around the concepts. I have a large background with SQL databases and I've never worked with any other type of data storage system, so I'm thinking that's where I'm running into trouble.
My current understanding is this: The NDB datastore is a collection of entities (analogous to DB records) that have properties (analogous to DB fields/columns). Entities are created using a Model (analogous to a DB schema). Every entity has a key that is generated for it when it is stored. This is where I run into trouble because these keys do not seem to have an analogy to anything in SQL DB concepts. They seem similar to primary keys for tables, but those are more tightly bound to records, and in fact are fields themselves. These NDB keys are not properties of entities, but are considered separate objects from entities. If an entity is stored in the datastore, you can retrieve that entity using its key.
One of my big questions is where do you get the keys for this? Some of the documentation I saw showed examples in which keys were simply created. I don't understand this. It seemed that when entities are stored, the put() method returns a key that can be used later. So how can you just create keys and define ids if the original keys are generated by the datastore?
Another thing that I seem to be struggling with is the concept of ancestry with keys. You can define parent keys of whatever kind you want. Is there a predefined schema for this? For example, if I had a model subclass called 'Person', and I created a key of kind 'Person', can I use that key as a parent of any other type? Like if I wanted a 'Shoe' key to be a child of a 'Person' key, could I also then declare a 'Car' key to be a child of that same 'Person' key? Or will I be unable to after adding the 'Shoe' key?
I'd really just like a simple explanation of the NDB datastore and its API for someone coming from a primarily SQL background.
I think you've overcomplicating things in your mind. When you create an entity, you can either give it a named key that you've chosen yourself, or leave that out and let the datastore choose a numeric ID. Either way, when you call put, the datastore will return the key, which is stored in the form [<entity_kind>, <id_or_name>] (actually this also includes the application ID and any namespace, but I'll leave that out for clarity).
You can make entities members of an entity group by giving them an ancestor. That ancestor doesn't actually have to refer to an existing entity, although it usually does. All that happens with an ancestor is that the entity's key includes the key of the ancestor: so it now looks like [<parent_entity_kind>, <parent_id_or_name>, <entity_kind>, <id_or_name>]. You can now only get the entity by including its parent key. So, in your example, the Shoe entity could be a child of the Person, whether or not that Person has previously been created: it's the child that knows about the ancestor, not the other way round.
(Note that that ancestry path can be extended arbitrarily: the child entity can itself be an ancestor, and so on. In this case, the group is determined by the entity at the top of the tree.)
Saving entities as part of a group has advantages in terms of consistency, in that a query inside an entity group is always guaranteed to be fully consistent, whereas outside the query is only eventually consistent. However, there are also disadvantages, in that the write rate of an entity group is limited to 1 per second for the whole group.
Datastore keys are a little more analogous to internal SQL row identifiers, but of course not entirely. Identifiers in Appengine are a bit like SQL primary keys. To support decentralised concurrent creation of new keys by many application instances in a cloud of servers, AppEngine internally generates the keys to guarantee uniqueness. Your application defines parameters (application identifier, optional namespace, kind and optional entity identifier) which AppEngine uses to seed its key generator. If you do not provide an identifier, AppEngine will generate a unique numeric identifier that you can read.
Eventual consistency takes time so it is occasionally more efficient to request multiple new keys in bulk. AppEngine then generates a range of numeric entity identifiers for you. You can read their values from keys as KeyProperty metadata.
Ancestry is used to group together writes of related entities of all kinds for the purpose of transactions and isolation. There is no predefined schema for this but you are limited to one parent per child.
In your example, one particular Shoe might have a particular Person as parent. Another particular Shoe could have a Horse as parent. And another Shoe might have no parent. Many entities of all kinds can have the same parent, so several Car entities could also have that initial Person as parent. The Datastore is schemaless, so it's up to your application to allow or forbid a Car to have a Horse as parent.
Note that a child knows its parent, but a parent does not know its children, because implementing that would impact scalability.
Related
I'm new to Google CloudDatastore and reading a document.
(Note: we don't plan to use Google AppEngine, just DataStore only.)
According to the document, DataStore supports transaction but
If you want to use queries within a transaction,
your data must be organized into entity groups in such a way
that you can specify ancestor filters that will match the right data.
So I thought as long as I want to use transaction, I am forced to create some parent key and set it as an ancestor. And all entities under the parent have a limitation that update and transaction can be only performed once per second.
However, I also see a very simple example of insert here:
https://cloud.google.com/datastore/docs/concepts/entities#datastore-insert-python
with client.transaction():
incomplete_key = client.key('Task')
task = datastore.Entity(key=incomplete_key)
task.update({
'category': 'Personal',
'done': False,
'priority': 4,
'description': 'Learn Cloud Datastore'
})
client.put(task)
It doesn't specify a parent and use a single root entity inside a transaction, does it ? Even about examples in Transaction page only the one for "read-only transaction" explicitly specifies a parent. Do other ones simply omit a parent while it actually exists?
I'm wondering I can use transaction without an entity group (= without a big performance degrade) if I can specify a key of a root entity, but there is no such description in the document...
I'd appreciate if someone can clarify the behavior. Thanks.
Transactions across multiple entity groups is indeed allowed (with a limit of 25 entity groups per documentation)
If you want to use queries within a transaction,
Note this key sentence in the text you quoted. It is saying any 'queries' you want to issue inside of a transaction need to be ancestor queries. This is because non-ancestor queries are eventually consistent, so it would be impossible for the transaction engine to reason about any state changes and hence not know when to fail or succeed the transaction. It is not saying you cannot to do transactions across entity groups.
It doesn't specify a parent and use a single root entity inside a
transaction, does it ?
I think this is the other source of confusion. Only children entities have parents specified to denote which entity group they are in. When no parent is specified, then the entity in question is a root entity (it's parent is root). Another way of saying this is every root entity is it's own entity group.
Technically the task entity in your description constitutes an entity group even though it has no child entities. The max number of entity groups allowed is 25 so if you try to create more than 25 top-level entities using this pattern your queries will fail.
The way I avoid performance hits is to use multiple entity groups. I structure my datastore so that I have multiple root entities and try to limit multiple transactions within an entity group.
I am new to GAE. I started working on NDB data store service. But the Parent key structure of it really confusing me. I also watched some tutorials on YouTube but they just explain its documentation.
I also followed the documentation but still it is not clear to me. It is the link which i explored.
Google App Engine NDB Data Store Service
NDB datastore is a distributed system. Absolute data consistency is very hard for distributed systems in general. By default NDB is eventually consistent. This means that by default:
If you add a record it may not appear immediately in a query
You cannot do transactions across multiple records by default
If you have more strict requirements you can define groups of entities by giving them the same parent key and specifying it in queries. You are then able to get consistent behaviour within these groups.
It is often better to not to use parent keys at all since they come with a heavy performance penalty. Most of the time apps do not need parent keys.
Quote from Entities, Properties, and Keys
There is a write throughput limit of about one transaction per second within a single entity group. This limitation exists because Datastore performs masterless, synchronous replication of each entity group over a wide geographic area to provide high reliability and fault tolerance.
I am working on a web application based on Google App Engine (Python / Webapp2) and Google NDB Datastore.
I assumed that if I tried to add a new entity using as parent key the key of a no longer existing entity an exception was thrown. I have instead found the entity is actually created.
Am i doing something wrong?
I may check before whether the parent still exist through a keys_only query. Does it consume GAE read quotas?
You can create a key for any entity whether this entity exists or not. This is because a key is simply an encoding of an entity kind and either an id or name (and ancestor keys, if any).
This means that you can store a child entity before a parent entity is saved, as long as you know the parent's id or name. You cannot reassign a child from one parent to another, though.
As for your second question, the AppEngine pricing page says:
Calls to the datastore API result in the following billable operations. Small datastore operations include calls to allocate datastore ids or keys-only queries. These operations are free.
Complementing on #andrei's answer to the first question, no key reference in Ndb is checked for refering to an existing entity, this is true for keys used as parent, as well as for keys used as ̀KeyProperty within an entity.
Say I have RootEntity, AEntity(child of RootEntity), BEntity(child of AEntity).
class RootEntity(ndb.Model):
rtp = ndb.StringProperty()
class AEntity(ndb.Model):
ap = ndb.IntegerProperty()
class BEntity(ndb.Model):
bp = ndb.StringProperty()
So in different handlers I need to get instances of BEntity with specific ancestor(instance of AEntity).
There is a my query: BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", AEntity.query(ancestor = ndb.Key("RootEntity", 1)).filter(AEntity.ap == int(some_value)).get().key.integer_id()))
How I can to optimize this query? Make it better, may be less sophisticated?
Upd:
This query is a part of function with #ndb.transactional decorator.
You should not use Entity Groups to represent entity relationships.
Entity groups have a special purpose: to define the scope of transactions. They give you ability to update multiple entities transactionally, as long as they are a part of the same entity group (this limitation has been somewhat relaxed with the new XG transactions). They also allow you to use queries within transactions (not available via XG transactions).
The downside of entity groups is that they have an update limitation of 1 write/second.
In your case my suggestion would be to use separate entities and make references between them. The reference should be a Key of the referenced entity as this is type-safe.
Regarding query simplicity: GAE unfortunately does not support JOINs or reference (multi-entity) queries, so you would still need to combine multiple queries together (as you do now).
There is a give and take with ancestor queries. They are a more verbose and messy to deal with but you get a better structure to your data and consistency in your queries.
To simplify this, if your handler knows the BEntity you want to get, just pass around the key.urlsafe() encoded key, it already has all of your ancestor information encoded.
If this is not possible, try possibly restructuring your data. Since these objects are all of the same ancestor, they belong to the same entity group, thus at most you can insert/update ~1 time per second for objects in that entity group. If you require higher throughput or do not require consistent ancestral queries, then try using ndb.KeyProperty to link entities with a reference to a parent rather than as an ancestor. Then you'd only need to get a single parent to query on rather than the parent and the parent's parent.
You should also try and use IDs whenever possible, so you can avoid having to filter for entities in your datastore by properties and just reference them by ID:
BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", int(some_value)))
Here, int(some_value) is the integer ID of the AEntity you used when you created that object. Just be sure that you can ensure the IDs you manually create/use will be unique across all instances of that Model that share the same parent.
EDIT:
To clarify, my last example should have been made more clear in that I was suggesting to restructure the data such that int(some_value) be used as the integer ID of the AEntity rather than storing is as a separate property of the Entity - if possible of course. From the example given, a query is performed for the AEntity objects that have a given integer field value of int(some_value) and executed with a get() - implying that you will always expect a single value return for that integer ID making it a good candidate to use as the integer ID for the key of that object eliminating the need for a query.
It is well-known that the AppEngine datastore is built on top of Bigtable, which is intrinsically sorted by key. It is also known (somewhat) how the keys are generated by the AppEngine datastore: by "combining" the app id, entity kind, instance path, and unique instance identifier (perhaps through concatenation, see here).
What is not clear is whether a transformation is done on that unique instance identifier before it is stored such as would make sequential keys non-sequential in storage (e.g. if I specify key_name="Test", is "Test" just concatenated at the end of the key without transformation?) Of course it makes sense to preserve the app-ids, entity kinds, and paths as-is to take advantage of locality/key-sorting in the Bigtable (Google's other primary storage technology, F1, works similarly with hierarchical keys), but I don't know about the unique instance identifier.
Can I rely upon key_names being preserved as-is in AppEngine's datastore?
Keys are composed using a special Protocol Buffer serialization that preserves the natural order of the fields it encodes. That means that yes, two entities with the same kind and parent will have their keys ordered by key name.
Note, though, that the sort order has entity type and parent key first, though, so two entities of different types, or of the same type but with different parent entities, will not appear sequentially even if their keys are sequential.
In addition to what #Nick explained:
If you use auto-generated numeric IDs, the legacy system used to be semi-increasing (IDs were assigned in increasing blocks), but with the new system they are pretty scattered.