From my understanding BigTable is a Column Oriented NoSQL database. Although Google Cloud Datastore is built on top of Google’s BigTable infrastructure I have yet to see documentation that expressively says that Datastore itself is a Column Oriented database. The fact that names reserved by the Python API are enforced in the API, but not in the Datastore itself makes me question the extent Datastore mirrors the internal workings of BigTable. For example, validation features in the ndb.Model class are enforced in the application code but not the datastore. An entity saved using the ndb.Model class can be retrieved someplace else in the app that doesn't use the Model class, modified, properties added, and then saved to datastore without raising an error until loaded into a new instance of the Model class. With that said, is it safe to say Google Cloud Datastore is a Column Oriented NoSQL database? If not, then what is it?
Strictly speaking, Google Cloud Datastore is distributed multi-dimensional sorted map. As you mentioned it is based on Google BigTable, however, it is only a foundation.
From high level point of view Datastore actually consists of three layers.
BigTable
This is a necessary base for Datastore. Maps row key, column key and timestamp (three-dimensional mapping) to an array of bytes. Data is stored in lexicographic order by row key.
High scalability and availability
Strong consistency for single row
Eventual consistency for multi-row level
Megastore
This layer adds transactions on top of the BigTable.
Datastore
A layer above Megastore. Enables to run queries as index scans on BigTable. Here index is not used for performance improvement but is required for queries to return results.
Furthermore, it optionally adds strong consistency for multi-row level via ancestor queries. Such queries force the respective indexes to update before executing actual scan.
Related
I'm building a web app to manage a fleet of moving vehicles. Each vehicle has a set of fixed data (like their license plate), and another set of data that gets updated 3 times a second via websocket (their state and GNSS coordinates). This is real-time data.
In the frontend, I need a view with a grid showing all the vehicles. The grid must display both the fixed data and the latest known values of the real-time data for each one. The user must be able to sort or filter by any column.
My current stack is Django + Django Rest Framework + Django Channels + PostgreSQL for the backend, and react-admin (React+Redux+FinalForm) for the frontend.
The problem:
Storing real-time data in the database would easily fulfill all my requirements as I would benefit from built-in DRF and Django's ORM sorting/filtering, but I believe that hitting the DB 3 times/sec for each vehicle could easily become a bottleneck when scaling up.
My current solution involves having the RT data in python objects that I serialize/deserialize (pickle) into/from the Django cache (REDIS-backed) instead of storing them as models in the database. However, I have to manually retrieve the data and DRF-serialize it in my DRF views. Therefore, sorting and filtering don't work as there are no SQL queries involved.
If pgsql had some sort of memory-backed tables with zero disk access it would be great, but, according to my research there is no such feature as of today.
So my question is:
What would be the correct/usual approach to my problem?
I think you should separate your service code into smaller service:
Receive data from vehicles
Process data
Separate vehicles into smaller group, use unique socket for each group
Try to update latest values as batch.
Use RAID for your database.
And I think that using cache for realtime data is wasted server resources.
I'm developing a web application in Python with Django for scheduling medical services. The database using is Mysql.
The system has several complex searches. For good performance, I'm using elasticsearch. The idea is to index only searchable data. The document created in the ES is not exactly the same as modeling in the database, for example, the Patient entity has relationships in the DB that are mapped as attributes in the ES.
I am in doubt at what time to update the record in the ES when some relationship is updated.
I am currently updating the registry in ES whenever an object is created, changed or deleted in the DB. In the case of relationships, I need to fetch the parent entity to update the record.
Another strategy would be to do the full index periodically. The problem with this strategy is the delay between a change and it's available for search.
What would be the best strategy in this case?
I am using appengine with python (version 2.7) for a web application which deals with job listings and job search.
Backend consists of a "Job" table which consists of 20+ fields such as title,date,experience etc. I have the necessary composite indexes defined for each of the filter's permutation and combination. As you would have guessed, the number of indexes are high.
The front-end provides option for users to search for jobs and filter them using the columns.
This works as expected but with following drawbacks:
Slow Search Performance
The search is divided into two parts: inbuilt datastore filtering and then a custom filtering on top of the refined results. The custom filtering is required to further apply the complex filters which are not supported by appengine.
Exploding composite indexes
Some columns (5 for instance) accepts only a set of values, so filtering using them is pretty straightforward. While other fields can have user defined values and hence filtering through them requires custom python code.
Jinja is the templating engine which then renders the data into the html.
Advanced Search + Index References: https://cloud.google.com/appengine/articles/indexselection
Is there a better approach/algorithm for implementing the search and advanced search in the appengine?
You might want to consider using the Full Text Search API available in App Engine. In essence, when entities are created in Cloud Datastore, you would create a Document with the entity ID/Key and all searchable fields and send it to the Search API for indexing. Any updates to the Datastore entities would also need to update the corresponding Search document. Also, when entities are deleted, delete the corresponding Search document.
Modify your Application's search code to perform the Search on Indexed documents instead of Datastore queries. Retrieve a page (e.g. 50) of Document IDs. Fetch the data for the 50 entities using a Datastore Get and display the results.
Per the documentation -
The Search API lets your application perform Google-like full-text
searches over structured data, and supports Geolocation-based queries.
It can be useful in any application domain that benefits from
full-text search, such as:
This would definitely give a better Search experience for your application users when compared with Datastore queries.
Once you implement this, you might be able to just get rid of the composite indexes from Datastore.
I am new to GAE. I started working on NDB data store service. But the Parent key structure of it really confusing me. I also watched some tutorials on YouTube but they just explain its documentation.
I also followed the documentation but still it is not clear to me. It is the link which i explored.
Google App Engine NDB Data Store Service
NDB datastore is a distributed system. Absolute data consistency is very hard for distributed systems in general. By default NDB is eventually consistent. This means that by default:
If you add a record it may not appear immediately in a query
You cannot do transactions across multiple records by default
If you have more strict requirements you can define groups of entities by giving them the same parent key and specifying it in queries. You are then able to get consistent behaviour within these groups.
It is often better to not to use parent keys at all since they come with a heavy performance penalty. Most of the time apps do not need parent keys.
Quote from Entities, Properties, and Keys
There is a write throughput limit of about one transaction per second within a single entity group. This limitation exists because Datastore performs masterless, synchronous replication of each entity group over a wide geographic area to provide high reliability and fault tolerance.
I'm creating a Google App Engine application (python) and I'm learning about the general framework. I've been looking at the tutorial and documentation for the NDB datastore, and I'm having some difficulty wrapping my head around the concepts. I have a large background with SQL databases and I've never worked with any other type of data storage system, so I'm thinking that's where I'm running into trouble.
My current understanding is this: The NDB datastore is a collection of entities (analogous to DB records) that have properties (analogous to DB fields/columns). Entities are created using a Model (analogous to a DB schema). Every entity has a key that is generated for it when it is stored. This is where I run into trouble because these keys do not seem to have an analogy to anything in SQL DB concepts. They seem similar to primary keys for tables, but those are more tightly bound to records, and in fact are fields themselves. These NDB keys are not properties of entities, but are considered separate objects from entities. If an entity is stored in the datastore, you can retrieve that entity using its key.
One of my big questions is where do you get the keys for this? Some of the documentation I saw showed examples in which keys were simply created. I don't understand this. It seemed that when entities are stored, the put() method returns a key that can be used later. So how can you just create keys and define ids if the original keys are generated by the datastore?
Another thing that I seem to be struggling with is the concept of ancestry with keys. You can define parent keys of whatever kind you want. Is there a predefined schema for this? For example, if I had a model subclass called 'Person', and I created a key of kind 'Person', can I use that key as a parent of any other type? Like if I wanted a 'Shoe' key to be a child of a 'Person' key, could I also then declare a 'Car' key to be a child of that same 'Person' key? Or will I be unable to after adding the 'Shoe' key?
I'd really just like a simple explanation of the NDB datastore and its API for someone coming from a primarily SQL background.
I think you've overcomplicating things in your mind. When you create an entity, you can either give it a named key that you've chosen yourself, or leave that out and let the datastore choose a numeric ID. Either way, when you call put, the datastore will return the key, which is stored in the form [<entity_kind>, <id_or_name>] (actually this also includes the application ID and any namespace, but I'll leave that out for clarity).
You can make entities members of an entity group by giving them an ancestor. That ancestor doesn't actually have to refer to an existing entity, although it usually does. All that happens with an ancestor is that the entity's key includes the key of the ancestor: so it now looks like [<parent_entity_kind>, <parent_id_or_name>, <entity_kind>, <id_or_name>]. You can now only get the entity by including its parent key. So, in your example, the Shoe entity could be a child of the Person, whether or not that Person has previously been created: it's the child that knows about the ancestor, not the other way round.
(Note that that ancestry path can be extended arbitrarily: the child entity can itself be an ancestor, and so on. In this case, the group is determined by the entity at the top of the tree.)
Saving entities as part of a group has advantages in terms of consistency, in that a query inside an entity group is always guaranteed to be fully consistent, whereas outside the query is only eventually consistent. However, there are also disadvantages, in that the write rate of an entity group is limited to 1 per second for the whole group.
Datastore keys are a little more analogous to internal SQL row identifiers, but of course not entirely. Identifiers in Appengine are a bit like SQL primary keys. To support decentralised concurrent creation of new keys by many application instances in a cloud of servers, AppEngine internally generates the keys to guarantee uniqueness. Your application defines parameters (application identifier, optional namespace, kind and optional entity identifier) which AppEngine uses to seed its key generator. If you do not provide an identifier, AppEngine will generate a unique numeric identifier that you can read.
Eventual consistency takes time so it is occasionally more efficient to request multiple new keys in bulk. AppEngine then generates a range of numeric entity identifiers for you. You can read their values from keys as KeyProperty metadata.
Ancestry is used to group together writes of related entities of all kinds for the purpose of transactions and isolation. There is no predefined schema for this but you are limited to one parent per child.
In your example, one particular Shoe might have a particular Person as parent. Another particular Shoe could have a Horse as parent. And another Shoe might have no parent. Many entities of all kinds can have the same parent, so several Car entities could also have that initial Person as parent. The Datastore is schemaless, so it's up to your application to allow or forbid a Car to have a Horse as parent.
Note that a child knows its parent, but a parent does not know its children, because implementing that would impact scalability.