It is well-known that the AppEngine datastore is built on top of Bigtable, which is intrinsically sorted by key. It is also known (somewhat) how the keys are generated by the AppEngine datastore: by "combining" the app id, entity kind, instance path, and unique instance identifier (perhaps through concatenation, see here).
What is not clear is whether a transformation is done on that unique instance identifier before it is stored such as would make sequential keys non-sequential in storage (e.g. if I specify key_name="Test", is "Test" just concatenated at the end of the key without transformation?) Of course it makes sense to preserve the app-ids, entity kinds, and paths as-is to take advantage of locality/key-sorting in the Bigtable (Google's other primary storage technology, F1, works similarly with hierarchical keys), but I don't know about the unique instance identifier.
Can I rely upon key_names being preserved as-is in AppEngine's datastore?
Keys are composed using a special Protocol Buffer serialization that preserves the natural order of the fields it encodes. That means that yes, two entities with the same kind and parent will have their keys ordered by key name.
Note, though, that the sort order has entity type and parent key first, though, so two entities of different types, or of the same type but with different parent entities, will not appear sequentially even if their keys are sequential.
In addition to what #Nick explained:
If you use auto-generated numeric IDs, the legacy system used to be semi-increasing (IDs were assigned in increasing blocks), but with the new system they are pretty scattered.
Related
I have a requirement where I would like to store two or more independent sets of key value pairs in Redis. they are required to be independent because they represent completely different datasets.
For example, consider the following independent kv sets:
Mapping between user_id and corresponding session_id
Mapping between game lobby id and corresponding host's user_id
Mapping between user_id and usernames
Now, if I had to store this key values "in memory" using some data structure in Python, I would user 3 separate dicts (or HashMaps in Java). But how can I achieve this in Redis? The separation is required because for almost all user_ids there will be session_id as well as username.
I'm referring this Redis doc for available datatypes: https://redis.io/topics/data-types and the closest match with my requirements is Hash. But as mentioned, they are objects and only space efficient if there are a few fields. So is this okay to store huge amounts of kv pairs? Will that impact search time going forward?
Are there any alternatives to achieve this using Redis? I'm completely new to Redis.
You can do that by creating separate databases (sometimes Redis calls them 'key-spaces') for each of your data sets. Then you create your connection/pool objects with the db parameter:
>>> pool = redis.ConnectionPool(host='redis_host', port=6379, db=0)
>>> r = redis.Redis(connection_pool=pool)
The only drawback of this is that you need to then manage different pools for each data set. Another (perhaps easier) option is to just add a tag to the keys that indicate which data set they belong to.
Also worth noting that the key-spaces are organized numerically (so you can't name them like in other database systems), and it only supports a small number (I think 15).
As for the number of K-V pairs, it's completely fine to store huge numbers of them, and lookups should still be pretty quick. O(1) lookup is the main advantage of key-based data structures, with the tradeoff being that you are limited (to some extent anyway) to using the key to access the data (as opposed to other databases where you can query on any field in the table).
I used GAE and NDB for a project. I just noticed that if I create several objects, and then I retrieve the list of these objects the order is not preserved (i use the fetch() on the object).
This is a screenshot of the admin page, which shows the same problem:
as you may (if it's too small here is the link) see i've several sessions. Now, i created the sessions that have as name day in order, from 0 to 7.
But as you see the order is not preserved.
I checked and actually the keys are not incremental. Neither the id (id should be incremental, shouldn't it? but anyway in some classes, not this one, I used a hand-made key, so there will be no id).
Is there a way to preserve insertion order?
(or it's just a strange behaviour? or it's my bad?)
PS: if you want to have a look at the code: this is the session model which extends this class i made
Neither keys nor ids are strictly incremental (and incremental by one) in ndb. You can set your own ids and assure they autoincrement properly.
Or you can add to your model(s) a DateTimeProperty:
created = ndb.DateTimeProperty(auto_now_add=True)
And in your view you can use a filter to sort the entities by the date of insertion, for ex:
posts = Post.query().order(-Post.created).fetch()
which will order and fetch your (let's say) Post entities in the descending order of insertion dates.
It's not expected that the order would be preserved unless you perform a query that would retrieve then in a particular order.
What makes you think they should be ordered?
Say I have RootEntity, AEntity(child of RootEntity), BEntity(child of AEntity).
class RootEntity(ndb.Model):
rtp = ndb.StringProperty()
class AEntity(ndb.Model):
ap = ndb.IntegerProperty()
class BEntity(ndb.Model):
bp = ndb.StringProperty()
So in different handlers I need to get instances of BEntity with specific ancestor(instance of AEntity).
There is a my query: BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", AEntity.query(ancestor = ndb.Key("RootEntity", 1)).filter(AEntity.ap == int(some_value)).get().key.integer_id()))
How I can to optimize this query? Make it better, may be less sophisticated?
Upd:
This query is a part of function with #ndb.transactional decorator.
You should not use Entity Groups to represent entity relationships.
Entity groups have a special purpose: to define the scope of transactions. They give you ability to update multiple entities transactionally, as long as they are a part of the same entity group (this limitation has been somewhat relaxed with the new XG transactions). They also allow you to use queries within transactions (not available via XG transactions).
The downside of entity groups is that they have an update limitation of 1 write/second.
In your case my suggestion would be to use separate entities and make references between them. The reference should be a Key of the referenced entity as this is type-safe.
Regarding query simplicity: GAE unfortunately does not support JOINs or reference (multi-entity) queries, so you would still need to combine multiple queries together (as you do now).
There is a give and take with ancestor queries. They are a more verbose and messy to deal with but you get a better structure to your data and consistency in your queries.
To simplify this, if your handler knows the BEntity you want to get, just pass around the key.urlsafe() encoded key, it already has all of your ancestor information encoded.
If this is not possible, try possibly restructuring your data. Since these objects are all of the same ancestor, they belong to the same entity group, thus at most you can insert/update ~1 time per second for objects in that entity group. If you require higher throughput or do not require consistent ancestral queries, then try using ndb.KeyProperty to link entities with a reference to a parent rather than as an ancestor. Then you'd only need to get a single parent to query on rather than the parent and the parent's parent.
You should also try and use IDs whenever possible, so you can avoid having to filter for entities in your datastore by properties and just reference them by ID:
BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", int(some_value)))
Here, int(some_value) is the integer ID of the AEntity you used when you created that object. Just be sure that you can ensure the IDs you manually create/use will be unique across all instances of that Model that share the same parent.
EDIT:
To clarify, my last example should have been made more clear in that I was suggesting to restructure the data such that int(some_value) be used as the integer ID of the AEntity rather than storing is as a separate property of the Entity - if possible of course. From the example given, a query is performed for the AEntity objects that have a given integer field value of int(some_value) and executed with a get() - implying that you will always expect a single value return for that integer ID making it a good candidate to use as the integer ID for the key of that object eliminating the need for a query.
I'm creating a Google App Engine application (python) and I'm learning about the general framework. I've been looking at the tutorial and documentation for the NDB datastore, and I'm having some difficulty wrapping my head around the concepts. I have a large background with SQL databases and I've never worked with any other type of data storage system, so I'm thinking that's where I'm running into trouble.
My current understanding is this: The NDB datastore is a collection of entities (analogous to DB records) that have properties (analogous to DB fields/columns). Entities are created using a Model (analogous to a DB schema). Every entity has a key that is generated for it when it is stored. This is where I run into trouble because these keys do not seem to have an analogy to anything in SQL DB concepts. They seem similar to primary keys for tables, but those are more tightly bound to records, and in fact are fields themselves. These NDB keys are not properties of entities, but are considered separate objects from entities. If an entity is stored in the datastore, you can retrieve that entity using its key.
One of my big questions is where do you get the keys for this? Some of the documentation I saw showed examples in which keys were simply created. I don't understand this. It seemed that when entities are stored, the put() method returns a key that can be used later. So how can you just create keys and define ids if the original keys are generated by the datastore?
Another thing that I seem to be struggling with is the concept of ancestry with keys. You can define parent keys of whatever kind you want. Is there a predefined schema for this? For example, if I had a model subclass called 'Person', and I created a key of kind 'Person', can I use that key as a parent of any other type? Like if I wanted a 'Shoe' key to be a child of a 'Person' key, could I also then declare a 'Car' key to be a child of that same 'Person' key? Or will I be unable to after adding the 'Shoe' key?
I'd really just like a simple explanation of the NDB datastore and its API for someone coming from a primarily SQL background.
I think you've overcomplicating things in your mind. When you create an entity, you can either give it a named key that you've chosen yourself, or leave that out and let the datastore choose a numeric ID. Either way, when you call put, the datastore will return the key, which is stored in the form [<entity_kind>, <id_or_name>] (actually this also includes the application ID and any namespace, but I'll leave that out for clarity).
You can make entities members of an entity group by giving them an ancestor. That ancestor doesn't actually have to refer to an existing entity, although it usually does. All that happens with an ancestor is that the entity's key includes the key of the ancestor: so it now looks like [<parent_entity_kind>, <parent_id_or_name>, <entity_kind>, <id_or_name>]. You can now only get the entity by including its parent key. So, in your example, the Shoe entity could be a child of the Person, whether or not that Person has previously been created: it's the child that knows about the ancestor, not the other way round.
(Note that that ancestry path can be extended arbitrarily: the child entity can itself be an ancestor, and so on. In this case, the group is determined by the entity at the top of the tree.)
Saving entities as part of a group has advantages in terms of consistency, in that a query inside an entity group is always guaranteed to be fully consistent, whereas outside the query is only eventually consistent. However, there are also disadvantages, in that the write rate of an entity group is limited to 1 per second for the whole group.
Datastore keys are a little more analogous to internal SQL row identifiers, but of course not entirely. Identifiers in Appengine are a bit like SQL primary keys. To support decentralised concurrent creation of new keys by many application instances in a cloud of servers, AppEngine internally generates the keys to guarantee uniqueness. Your application defines parameters (application identifier, optional namespace, kind and optional entity identifier) which AppEngine uses to seed its key generator. If you do not provide an identifier, AppEngine will generate a unique numeric identifier that you can read.
Eventual consistency takes time so it is occasionally more efficient to request multiple new keys in bulk. AppEngine then generates a range of numeric entity identifiers for you. You can read their values from keys as KeyProperty metadata.
Ancestry is used to group together writes of related entities of all kinds for the purpose of transactions and isolation. There is no predefined schema for this but you are limited to one parent per child.
In your example, one particular Shoe might have a particular Person as parent. Another particular Shoe could have a Horse as parent. And another Shoe might have no parent. Many entities of all kinds can have the same parent, so several Car entities could also have that initial Person as parent. The Datastore is schemaless, so it's up to your application to allow or forbid a Car to have a Horse as parent.
Note that a child knows its parent, but a parent does not know its children, because implementing that would impact scalability.
Let's say there is a table of People. and let's say that are 1000+ in the system. Each People item has the following fields: name, email, occupation, etc.
And we want to allow a People item to have a list of names (nicknames & such) where no other data is associated with the name - a name is just a string.
Is this exactly what the pickleType is for? what kind of performance benefits are there between using pickle type and creating a Name table to have the name field of People be a one-to-many kind of relationship?
Yes, this is one good use case of sqlalchemy's PickleType field, documented very well here. There are obvious performance advantages to using this.
Using your example, assume you have a People item which uses a one to many database look. This requires the database to perform a JOIN to collect the sub-elements; in this case, the Person's nicknames, if any. However, you have the benefit of having native objects ready to use in your python code, without the cost of deserializing pickles.
In comparison, the list of strings can be pickled and stored as a PickleType in the database, which are internally stores as a LargeBinary. Querying for a Person will only require the database to hit a single table, with no JOINs which will result in an extremely fast return of data. However, you now incur the "cost" of de-pickling each item back into a python object, which can be significant if you're not storing native datatypes; e.g. string, int, list, dict.
Additionally, by storing pickles in the database, you also lose the ability for the underlying database to filter results given a WHERE condition; especially with integers and datetime objects. A native database call can return values within a given numeric or date range, but will have no concept of what the string representing these items really is.
Lastly, a simple change to a single pickle could allow arbitrary code execution within your application. It's unlikely, but must be stated.
IMHO, storing pickles is a nice way to store certain types of data, but will vary greatly on the type of data. I can tell you we use it pretty extensively in our schema, even on several tables with over half a billions records quite nicely.