I am new to GAE. I started working on NDB data store service. But the Parent key structure of it really confusing me. I also watched some tutorials on YouTube but they just explain its documentation.
I also followed the documentation but still it is not clear to me. It is the link which i explored.
Google App Engine NDB Data Store Service
NDB datastore is a distributed system. Absolute data consistency is very hard for distributed systems in general. By default NDB is eventually consistent. This means that by default:
If you add a record it may not appear immediately in a query
You cannot do transactions across multiple records by default
If you have more strict requirements you can define groups of entities by giving them the same parent key and specifying it in queries. You are then able to get consistent behaviour within these groups.
It is often better to not to use parent keys at all since they come with a heavy performance penalty. Most of the time apps do not need parent keys.
Quote from Entities, Properties, and Keys
There is a write throughput limit of about one transaction per second within a single entity group. This limitation exists because Datastore performs masterless, synchronous replication of each entity group over a wide geographic area to provide high reliability and fault tolerance.
Related
From my understanding BigTable is a Column Oriented NoSQL database. Although Google Cloud Datastore is built on top of Google’s BigTable infrastructure I have yet to see documentation that expressively says that Datastore itself is a Column Oriented database. The fact that names reserved by the Python API are enforced in the API, but not in the Datastore itself makes me question the extent Datastore mirrors the internal workings of BigTable. For example, validation features in the ndb.Model class are enforced in the application code but not the datastore. An entity saved using the ndb.Model class can be retrieved someplace else in the app that doesn't use the Model class, modified, properties added, and then saved to datastore without raising an error until loaded into a new instance of the Model class. With that said, is it safe to say Google Cloud Datastore is a Column Oriented NoSQL database? If not, then what is it?
Strictly speaking, Google Cloud Datastore is distributed multi-dimensional sorted map. As you mentioned it is based on Google BigTable, however, it is only a foundation.
From high level point of view Datastore actually consists of three layers.
BigTable
This is a necessary base for Datastore. Maps row key, column key and timestamp (three-dimensional mapping) to an array of bytes. Data is stored in lexicographic order by row key.
High scalability and availability
Strong consistency for single row
Eventual consistency for multi-row level
Megastore
This layer adds transactions on top of the BigTable.
Datastore
A layer above Megastore. Enables to run queries as index scans on BigTable. Here index is not used for performance improvement but is required for queries to return results.
Furthermore, it optionally adds strong consistency for multi-row level via ancestor queries. Such queries force the respective indexes to update before executing actual scan.
I'm creating a Google App Engine application (python) and I'm learning about the general framework. I've been looking at the tutorial and documentation for the NDB datastore, and I'm having some difficulty wrapping my head around the concepts. I have a large background with SQL databases and I've never worked with any other type of data storage system, so I'm thinking that's where I'm running into trouble.
My current understanding is this: The NDB datastore is a collection of entities (analogous to DB records) that have properties (analogous to DB fields/columns). Entities are created using a Model (analogous to a DB schema). Every entity has a key that is generated for it when it is stored. This is where I run into trouble because these keys do not seem to have an analogy to anything in SQL DB concepts. They seem similar to primary keys for tables, but those are more tightly bound to records, and in fact are fields themselves. These NDB keys are not properties of entities, but are considered separate objects from entities. If an entity is stored in the datastore, you can retrieve that entity using its key.
One of my big questions is where do you get the keys for this? Some of the documentation I saw showed examples in which keys were simply created. I don't understand this. It seemed that when entities are stored, the put() method returns a key that can be used later. So how can you just create keys and define ids if the original keys are generated by the datastore?
Another thing that I seem to be struggling with is the concept of ancestry with keys. You can define parent keys of whatever kind you want. Is there a predefined schema for this? For example, if I had a model subclass called 'Person', and I created a key of kind 'Person', can I use that key as a parent of any other type? Like if I wanted a 'Shoe' key to be a child of a 'Person' key, could I also then declare a 'Car' key to be a child of that same 'Person' key? Or will I be unable to after adding the 'Shoe' key?
I'd really just like a simple explanation of the NDB datastore and its API for someone coming from a primarily SQL background.
I think you've overcomplicating things in your mind. When you create an entity, you can either give it a named key that you've chosen yourself, or leave that out and let the datastore choose a numeric ID. Either way, when you call put, the datastore will return the key, which is stored in the form [<entity_kind>, <id_or_name>] (actually this also includes the application ID and any namespace, but I'll leave that out for clarity).
You can make entities members of an entity group by giving them an ancestor. That ancestor doesn't actually have to refer to an existing entity, although it usually does. All that happens with an ancestor is that the entity's key includes the key of the ancestor: so it now looks like [<parent_entity_kind>, <parent_id_or_name>, <entity_kind>, <id_or_name>]. You can now only get the entity by including its parent key. So, in your example, the Shoe entity could be a child of the Person, whether or not that Person has previously been created: it's the child that knows about the ancestor, not the other way round.
(Note that that ancestry path can be extended arbitrarily: the child entity can itself be an ancestor, and so on. In this case, the group is determined by the entity at the top of the tree.)
Saving entities as part of a group has advantages in terms of consistency, in that a query inside an entity group is always guaranteed to be fully consistent, whereas outside the query is only eventually consistent. However, there are also disadvantages, in that the write rate of an entity group is limited to 1 per second for the whole group.
Datastore keys are a little more analogous to internal SQL row identifiers, but of course not entirely. Identifiers in Appengine are a bit like SQL primary keys. To support decentralised concurrent creation of new keys by many application instances in a cloud of servers, AppEngine internally generates the keys to guarantee uniqueness. Your application defines parameters (application identifier, optional namespace, kind and optional entity identifier) which AppEngine uses to seed its key generator. If you do not provide an identifier, AppEngine will generate a unique numeric identifier that you can read.
Eventual consistency takes time so it is occasionally more efficient to request multiple new keys in bulk. AppEngine then generates a range of numeric entity identifiers for you. You can read their values from keys as KeyProperty metadata.
Ancestry is used to group together writes of related entities of all kinds for the purpose of transactions and isolation. There is no predefined schema for this but you are limited to one parent per child.
In your example, one particular Shoe might have a particular Person as parent. Another particular Shoe could have a Horse as parent. And another Shoe might have no parent. Many entities of all kinds can have the same parent, so several Car entities could also have that initial Person as parent. The Datastore is schemaless, so it's up to your application to allow or forbid a Car to have a Horse as parent.
Note that a child knows its parent, but a parent does not know its children, because implementing that would impact scalability.
Edit: Main Topic:
Is there a way I can force SqlAlchemy to pre-populate the session as far as it can? Syncronize as much state from the database as possible (there will be no DB updates at this point).
I am having some mild performance issues and I believe I have traced it to SqlAlchemy. I'm sure there are changes in my declarative and db-schema that could improve time, but that is not what I am asking about here.
My SqlAlchemy declarative defines 8 classes, my database has 11 tables with only 7 of them holding my real data, and total my database has 800 records (all Integers and UnicodeText). My database engine is sqlite and the actual size is currently 242Kb.
Really, the number of entities is quite small, but many of the table relationships have recursive behavior (5-6 levels deep). My problem starts with the wonderful automagic that SA does for me, and my reluctance to properly extract the data with my own python classes.
I have ORM attribute access scattered across all kinds of iterators, recursive evaluators, right up to my file I/O streams. The access on these attributes is largely non-linear, and ever time I do a lookup, my callstack disappears into SqlAlchemy for quite some time, and I am getting lots of singleton queries.
I am using mostly default SA settings (python 2.7.2, sqlalchemy 0.7).
Considering that RAM is not an issue, and that my database is so small (for the time being), is there a way I can just force SqlAlchemy to pre-populate the session as far as it can. I am hoping that if I just load the raw data into memory, then the most I will have to do is chase a few joins dynamically (almost all queries are pretty straighforward).
I am hoping for a 5 minute fix so I can run some reports ASAP. My next month of TODO is likely going to be full of direct table queries and tighter business logic that can pipeline tuples.
A five minute fix for that kind of issue is unlikely, but for many-to-one "singleton" gets there is a simple recipe I use often. Suppose you're loading lots of User objects and they all have many-to-one references to a Category of some kind:
# load all categories, then hold onto them
categories = Session.query(Category).all()
for user in Session.query(User):
print user, user.category # no SQL will be emitted for the Category
this because the query.get() that a many-to-one emits will look in the local identity map for the primary key first.
If you're looking for more caching than that (and have a bit more than five minutes to spare), the same concept can be expanded to also cache the results of SELECT statements in a way that the cache is associated only with the current Session - check out the local_session_caching.py recipe included with the distribution examples.
Okay, I have watched the video and read the articles in the App Engine documentation (including Using the High Replication Datastore). However I am still completely confused on the practical usage of it. I understand the benefits (from the video) and they sound great. But what I am lacking is a few practical examples. There are plenty of master/slave examples on the web, but very little illustrating (with proper documentation) the high replication datastore. The guestbook code example used in the Using the High Replication Datastore article illustrates the ancestor key by adding a new functionality that the previous guestbook code example does not have (seems you can change guestbook). This just adds to the confusion.
I often use djangoforms on GAE and I was wondering if someone can help me translate all these queries into high replication datastore compatible queries (let's forget for a moment the discussion that not all queries necessarily need to be high replication datastore compatible queries and focus on the example itself).
UPDATE: with high replication datastore compatible queries I refer to queries that always return the latest data and not potential stale data. Using entity groups seems to be the way to go here but as mentioned before, I don't have many practical code examples of how to do this, so that is what I am looking for!
So the queries in this article are:
The main recurring query in this article is:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
which we will translate to:
query = Item.all().order('name') // datastore request
validating the form happens like:
data = ItemForm(data=self.request.POST)
if data.is_valid():
# Save the data, and redirect to the view page
entity = data.save(commit=False)
entity.added_by = users.get_current_user()
entity.put() // datastore request
and getting the latest entry from the datastore for populating a form happens like:
id = int(self.request.get('id'))
item = Item.get(db.Key.from_path('Item', id)) // datastore request
data = ItemForm(data=self.request.POST, instance=item)
So what do I/we need to do to make all these datastore requests compatible with the high replication datastore?
One last thing that is also not clear to me. Using ancestor keys, does this have any impact on the model in datastore. For example, in the guestbook code example they use:
def guestbook_key(guestbook_name=None):
return db.Key.from_path('Guestbook', guestbook_name or 'default_guestbook')
However 'Guestbook' does not exist in the model, so how can you use 'db.Key.from_path' on this and why would this work? Does this change how data is stored in the datastore which I need to keep into account when retrieving the data (e.g. does it add another field I should exclude from showing when using djangoforms)?
Like I said before, this is confusing me a lot and your help is greatly appreciated!
I'm not sure why you think you need to change your queries at all. The documentation that you link to clearly states:
The back end changes, but the datastore API does not change at all. You'll use the same programming interfaces no matter which datastore you're using.
The point of that page is just to say that queries may be out of sync if you don't use entity groups. Your final code snippet is just an example of that - the string 'Guestbook' is exactly an ancestor key. I don't understand why you think it needs to exist in the model. Once again, this is unchanged from the non-HR datastore - it has always been the case that keys are built up from paths, which can consist of arbitrary strings. You probably need to reread the documentation on entity groups and keys.
The changes to use the HRD are not in how queries are made, but in what guarantees are made about what data you get back. The example you give:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
will work in the HRD as well. The catch (basically) is that this kind of query (using either this syntax, or the Item.all() form) can return objects slightly out-of-date. This is probably not a big deal with the guestbook.
Note that if you're getting an object by key directly, it will never be out-of-date. It's only for queries that you can see this issue. You can avoid this problem with queries by placing all the entities that need to be consistent in a single entity group. Note that this limits the rate at which you can write to the entity group.
In answer to your follow-up question, "Guestbook" is the name of the entity.
I want to be able to give each instance of a particular object a unique number id in the order that they are created, so I was thinking of getting the number of the particular entity already in the datastore and add 1 to get the new number.
I know I can do something like
query = Object.all()
count = query.count()
but that has some limitations.
Does anybody know a better way to find the number of particular entities or even a better way to give objects a unique sequential number id?
Thanks!
Why do your IDs need to be sequential? The App Engine datastore generates integer IDs for your entities already; they're not guaranteed to be sequential, but they are guaranteed to be unique, and they tend to be small.
The ID generation strategy for App Engine is not 'perfect' - entirely sequential - because doing so in a distributed system is impractical, as it introduces a single bottleneck (the service that hands out IDs). Any system you build will suffer from the same issue, unless you want only a low rate of ID issuance (eg, 1 per second or less).
Usual answer to many questions on non-relational DBs: denormalize wisely, by keeping a model with the sole purpose of counting and a single entity -- in the factory function building entities of other model, transactionally increment said counter and use the value in the building. If this proves to be a bottleneck, consider sharded counters or other parallelization techniques for counters -- again, just as usual for App Engine (the fact that you're using that counter as a UID does not really affect this choice).
For Python GAE SDK, you can increase the argument "limit" of the count method:
https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_count
Replicated from Google AppEngine: how to count a database's entries beyond 1000?