How to query if entity exists in app engine NDB - python

I'm having some trouble wrapping my head around NDB. For some reason it's just not clicking. The thing i'm struggling with the most is the whole key/kind/ancestor structure.
I'm just trying to store a simple set of Json data. When i store data, i want to check beforehand to see if a duplicate entity exists (based on the key, not the data) so i don't store a duplicate entity.
class EarthquakeDB(ndb.Model):
data = ndb.JsonProperty()
datetime = ndb.DateTimeProperty(auto_now_add=True)
Then, to store data:
quake_entry = EarthquakeDB(parent=ndb.Key('Earthquakes', quake['id']), data=quake).put()
So my questions are:
How do i check to see if that particular key exists before i insert more data?
How would i go about pulling that data out to read based on the key?

After some trial and error, and with the assistance of voscausa, here is what i came up with to solve the problem. The data is being read in via a for loop.
for quake in data:
quake_entity = EarthquakeDB.get_by_id(quake['id'])
if quake_entity:
continue
else:
quate_entity = EarthquakeDB(id=quake['id'], data=quake).put()

Because you do not provide a full NDB key (only a parent) you will always insert a unique key.
But you use your own entity id for the parent? Why?
I think you mean:
quake_entry = EarthquakeDB(id=quake['id'], data=quake)
quake_entry.put()
To get it, you can use:
quate_entry = ndb.Key('Earthquakes', quake['id']).get()
Here you can find two excellent videos about the datastore, strong consistency and entity groups. Datastore Introduction and Datastore Query, Index and Transaction.

Related

Azure Table Storage sync between 2 different storages

I have a list of storage accounts and I would like to copy the exact table content from source_table to destination_table exactly how it is. Which mean if I add an entry to source_table that will be moved to the destination_table same think if I delete the entry from source I want it to be deleted from destination.
So far I have in place this code:
source_table = TableService(account_name="sourcestorageaccount",
account_key="source key")
destination_storage = TableService(account_name="destination storage",
account_key="destinationKey")
query_size = 1000
# save data to storage2 and check if there is lefted data in current table,if yes recurrence
def queryAndSaveAllDataBySize(source_table_name, target_table_name, resp_data: ListGenerator,
table_out: TableService, table_in: TableService, query_size: int):
for item in resp_data:
tb_name = source_table_name
del item.etag
del item.Timestamp
print("INSERT data:" + str(item) + "into TABLE:" + tb_name)
table_in.insert_or_replace_entity(target_table_name, item)
if resp_data.next_marker:
data = table_out.query_entities(table_name=source_table_name, num_results=query_size,
marker=resp_data.next_marker)
queryAndSaveAllDataBySize(source_table_name, target_table_name, data, table_out, table_in, query_size)
tbs_out = table_service_out.list_tables()
print(tbs_out)
for tb in tbs_out:
table = tb.name
# create table with same name in storage2
table_service_in.create_table(table_name=table, fail_on_exist=False)
# first query
data = table_service_out.query_entities(tb.name, num_results=query_size)
queryAndSaveAllDataBySize(tb.name, table, data, table_service_out, table_service_in, query_size)
As you can see this block of code up runs just perfectly, it loops over the source storage account table and creates the same table and its content in destination storage account. but I am missing the part of how I can check if a record has been deleted from the source storage and remove the same record from the destination table.
I hope my question/issue is clear enough, and if not please just ask me for more informations.
Thank you so much for any help you can provide
UPDATE:
The more a think about this the more the logic get messy.
One of the solution that I thought about and tried is to have 2 lists to store every single table entity:
Source_table_entries
Destination_table_entries
Once I have populated the lists for each run I can compare the partition keys and if a partition key is present in Destination_table_entries but on in source, that will me promoted to be deleted.
But this logic will work flawless as long as I have a small table, unfortunately some table contains hundreds of thousands of entities (and I have hundreds of storages) which sooner or later will become a mess to managed.
So one of the solution that I thought about. Is to keep the same code I have above and just create a new table every week and delete the older one (from the destination storage). For example
Table week 1
Table week 2
Table week 3 (this will be deleted)
I read around that I could potentially add a metadata to the table for date and leverage that to decide which table should be deleted based on date time. But I cannot find anything in the documentation.
Can anyone please direct me on the best approach for this. Thank you so much, I am loosing my mind on this last bit

Unable to update document without specify Primary Key

schema.py:
class Test(Document):
_id = StringField()
classID = StringField(required=True, unique=True)
status = StringField()
====================
database.py:
query = schema.Test(_id = id)
query.update(status = "confirm")
Critical error occured. attempt to update a document not yet saved
I can update the DB only if I specify _id = StringField(primary_key=True), but if I insert a new data, the _id has to be inserted by me instead of automatically created by MongoDB.
Anyone can help me with a solution?
Thanks!
Inserts and updates are distinct operations in MongoDB:
Insert adds a document to the collection
Update finds a document in the collection given a search criteria, then changes this document
If you haven't inserted a document, trying to update it won't do anything since it will never be found by any search criteria. Your ODM is pointing this out to you and prevents you from updating a document you haven't saved. Using the driver you can issue the update anyway but it won't have any effect.
If you want to add a new document to the database, use inserts. To change documents that are already saved, use updates. To change fields on document instances without saving them, consult your ODM documentation to figure out how to do that instead of attempting to save the documents.

How to efficiently fetch objects after created using bulk_create function of Django ORM?

I have to insert multiple objects in a table, there are two ways to do that-
1) Insert each one using save(). But in this case there will be n sql dB queries for n objects.
2) Insert all of them together using bulk_create(). In this case there will be one sql dB query for n objects.
Clearly, second option is better and hence I am using that. Now the problem with bulk__create is that it does not return ids of the inserted objects hence they can not be used further to create objects of other models which have foreign key to the created objects.
To overcome this, we need to fetch the objects created by bulk_create.
Now the question is "assuming as in my situation, there is no way to uniquely identify the created objects, how do we fetch them?"
Currently I am maintaining a time_stamp to fetch them, something like below-
my_objects = []
# Timestamp to be used for fetching created objects
time_stamp = datetime.datetime.now()
# Creating list of intantiated objects
for obj_data in obj_data_list:
my_objects.append(MyModel(**obj_data))
# Bulk inserting the instantiated objects to dB
MyModel.objects.bulk_create(my_objects)
# Using timestamp to fetch the created objects
MyModel.objects.filter(created_at__gte=time_stamp)
Now this works good, but will fail in one case.
If at the time of bulk-creating these objects, some more objects are created from somewhere else, then those objects will also be fetched in my query, which is not desired.
Can someone come up with a better solution?
As bulk_create will not create the primary keys, you'll have to supply the keys yourself.
This process is simple if you are not using the default generated primary key, which is an AutoField.
If you are sticking with the default, you'll need to wrap your code into an atomic transaction and supply the primary key yourself. This way you'll know what records are inserted.
from django.db import transaction
inserted_ids = []
with transacation.atomic():
my_objects = []
max_id = int(MyModel.objects.latest('pk').pk)
id_count = max_id
for obj_data in obj_data_list:
id_count += 1
obj_data['id'] = id_count
inserted_ids.append(obj_data['id'])
my_objects.append(MyModel(**obj_data))
MyModel.objects.bulk_create(my_objects)
inserted_ids = range(max_id, id_count)
As you already know.
If the model’s primary key is an AutoField it does not retrieve and
set the primary key attribute, as save() does.
The way you're trying to do, it's usually the way people do.
The other solution in some cases, this way is better.
my_ids = MyModel.objects.values_list('id', flat=True)
objs = MyModel.objects.bulk_create(my_objects)
new_objs = MyModel.objects.exclude(id__in=my_ids).values_list('id', flat=True)

GAE python ndb - How to get_by_id with projection?

I'd like to do this.
Content.get_by_id(content_id, projection=['title'])
However, I got an error.
TypeError: Unknown configuration option ('projection')
I should do like this. How?
Content.query(key=Key('Content', content_id)).get(projection=['title'])
Why bother projection for getting an entity? Because Content.body could be large so that I want to reduce db read time and instance hours.
If you are using ndb, the below query should work
Content.query(key=Key('Content', content_id)).get(projection=[Content.title])
Note: It gets this data from the query index. So, make sure that index is enabled for the column. Reference https://developers.google.com/appengine/docs/python/ndb/queries#projection
I figured out that following code.
Content.query(Content.key == ndb.Key('Content', content_id)).get(projection=['etag'])
I found a hint from https://developers.google.com/appengine/docs/python/ndb/properties
Don't name a property "key." This name is reserved for a special
property used to store the Model key. Though it may work locally, a
property named "key" will prevent deployment to App Engine.
There is a simpler method than the currently posted answers.
As previous answers have mentioned, projections are only for ndb.Queries.
Previous answers suggest to use the entity returned by get_by_id to perform a projection query in the form of:
<Model>.query(<Model>.key == ndb.Key('<Model>', model_id).get(projection=['property_1', 'property_2', ...])
However, you can just manipulate the model's _properties directly. (See: https://cloud.google.com/appengine/docs/standard/python/ndb/modelclass#intro_properties)
For example:
desired_properties = ['title', 'tags']
content = Content.get_by_id(content_id)
content._properties = {k: v for k, v in content._properties.iteritems()
if k in desired_properties}
print content
This would update the entity properties and only return those properties whose keys are in the desired_properties list.
Not sure if this is the intended functionality behind _properties but it works, and it also prevents the need of generating/maintaining additional indexes for the projection queries.
The only down-side is that this retrieves the entire entity in-memory first. If the entity has arbitrarily large metadata properties that will affect performance, it would be a better idea to use the projection query instead.
Projection is only for query, not get by id. You can put the content.body in a different db model and store only the ndb.Key of it in the Content.

Best method to determine which of a set of keys exist in the datastore

I have a few hundred keys, all of the same Model, which I have pre-computed:
candidate_keys = [db.Key(...), db.Key(...), db.Key(...), ...]
Some of these keys refer to actual entities in the datastore, and some do not. I wish to determine which keys do correspond to entities.
It is not necessary to know the data within the entities, just whether they exist.
One solution would be to use db.get():
keys_with_entities = set()
for entity in db.get(candidate_keys):
if entity:
keys_with_entities.add(entity.key())
However this procedure would fetch all entity data from the store which is unnecessary and costly.
A second idea is to use a Query with an IN filter on key_name, manually fetching in chunks of 30 to fit the requirements of the IN pseudo-filter. However keys-only queries are not allowed with the IN filter.
Is there a better way?
IN filters are not supported directly by the App Engine datastore; they're a convenience that's implemented in the client library. An IN query with 30 values is translated into 30 equality queries on one value each, resulting in 30 regular queries!
Due to round-trip times and the expense of even keys-only queries, I suspect you'll find that simply attempting to fetch all the entities in one batch fetch is the most efficient. If your entities are large, however, you can make a further optimization: For every entity you insert, insert an empty 'presence' entity as a child of that entity, and use that in queries. For example:
foo = AnEntity(...)
foo.put()
presence = PresenceEntity(key_name='x', parent=foo)
presence.put()
...
def exists(keys):
test_keys = [db.Key.from_path('PresenceEntity', 'x', parent=x) for x in keys)
return [x is not None for x in db.get(test_keys)]
At this point, the only solution I have is to manually query by key with keys_only=True, once per key.
for key in candidate_keys:
if MyModel.all(keys_only=True).filter('__key__ =', key).count():
keys_with_entities.add(key)
This may in fact be slower then just loading the entities in batch and discarding them, although the batch load also hammers the Data Received from API quota.
How not to do it (update based on Nick Johnson's answer):
I am also considering adding a parameter specifically for the purpose of being able to scan for it with an IN filter.
class MyModel(db.Model):
"""Some model"""
# ... all the old stuff
the_key = db.StringProperty(required=True) # just a duplicate of the key_name
#... meanwhile back in the example
for key_batch in batches_of_30(candidate_keys):
key_names = [x.name() for x in key_batch]
found_keys = MyModel.all(keys_only=True).filter('the_key IN', key_names)
keys_with_entities.update(found_keys)
The reason this should be avoided is that the IN filter on a property sequentially performs an index scan, plus lookup once per item in your IN set. Each lookup takes 160-200ms so that very quickly becomes a very slow operation.

Categories

Resources