Delete Data store entries synchronously in Google App Engine - python

I use python in GAP and try to delete one entries in datastore by using db.delete(model_obj). I suppose this operation is undertaken synchronously, since the document tell the difference between delete() and delete_async(), but when I read the source code in the db, the delete method just simply call the delete_async, which is not match what the document says :(
So is there any one to do delete in synchronous flow?
Here is the source code in db:
def delete_async(models, **kwargs):
"""Asynchronous version of delete one or more Model instances.
Identical to db.delete() except returns an asynchronous object. Call
get_result() on the return value to block on the call.
"""
if isinstance(models, (basestring, Model, Key)):
models = [models]
else:
try:
models = iter(models)
except TypeError:
models = [models]
keys = [_coerce_to_key(v) for v in models]
return datastore.DeleteAsync(keys, **kwargs)
def delete(models, **kwargs):
"""Delete one or more Model instances.
"""
delete_async(models, **kwargs).get_result()
EDIT: From a comment, this is the original misbehaving code:
def tearDown(self):
print self.account
db.delete(self.device)
db.delete(self.account)
print Account.get_by_email(self.email, case_sensitive=False)
The result for two print statement is <Account object at 0x10d1827d0> <Account object at 0x10d1825d0>. Even two memory addresses are different but they point to the same object. If I put some latency after the delete like for loop, the object fetched is None.

The code you show for delete calls delete_async, yes, but then it calls get_result on the returned asynchronous handle, which will block until the delete actually occurs. So, delete is synchronous.
The reason the sample code you show is returning an object is that you're probably running a query to fetch the account; I presume the email is not the db.Key of the account? Normal queries are not guaranteed to return updated results immediately. To avoid seeing stale data, you either need to use an ancestor query or look up the entity by key, both of which are strongly consistent.

Related

Understanding session with fastApi dependency

I am new to Python and was studying FastApi and SQL model.
Reference link: https://sqlmodel.tiangolo.com/tutorial/fastapi/session-with-dependency/#the-with-block
Here, they have something like this
def create_hero(*, session: Session = Depends(get_session), hero: HeroCreate):
db_hero = Hero.from_orm(hero)
session.add(db_hero)
session.commit()
session.refresh(db_hero)
return db_hero
Here I am unable to understand this part
session.add(db_hero)
session.commit()
session.refresh(db_hero)
What is it doing and how is it working?
Couldn't understand this
In fact, you could think that all that block of code inside of the create_hero() function is still inside a with block for the session, because this is more or less what's happening behind the scenes.
But now, the with block is not explicitly in the function, but in the dependency above:
It's an explanation from docs what is a session
In the most general sense, the Session establishes all conversations
with the database and represents a “holding zone” for all the objects
which you’ve loaded or associated with it during its lifespan. It
provides the interface where SELECT and other queries are made that
will return and modify ORM-mapped objects. The ORM objects themselves
are maintained inside the Session, inside a structure called the
identity map - a data structure that maintains unique copies of each
object, where “unique” means “only one object with a particular
primary key”.
So
# This line just simply create a python object
# that sqlalchemy would "understand".
db_hero = Hero.from_orm(hero)
# This line add the object `db_hero` to a “holding zone”
session.add(db_hero)
# This line take all objects from a “holding zone” and put them in a database
# In our case we have only one object in this zone,
# but it is possible to have several
session.commit()
# This line gets created row from the database and put it to the object.
# It means it could have new attributes. For example id,
# that database would set for this new row
session.refresh(db_hero)

Async behaviour within a with statement

I am building a wrapper for an API, in order to make it more accessible to our users. The user initialises the SomeAPI object and then has access to lots of class methods, as defined below.
One of the operations I wish to support is creating what we call a "instance".
Once the instance is no longer required, it should be deleted. Therefore I would use contextlib.contextmanager like so:
class SomeAPI:
# Lots of methods
...
...
def create_instance(self, some_id):
# Create an instance for some_id
payload = {"id": some_id}
resp_url = ".../instance"
# This specific line of code may take a long time
resp = self.requests.post(resp_url, json=payload)
return resp.json()["instance_id"]
def delete_instance(self, instance_id):
# Delete a specific instance
resp_url = f".../instance/{instance_id}"
resp = self.requests.delete(resp_url)
return
#contextlib.contextmanager
def instance(self, some_id):
instance_id = self.create_instance(some_id)
try:
yield instance_id
finally:
if instance_id:
self.delete_instance(instance_id)
So then our users can write code like this
some_api = SomeApi()
# Necessary preprocessing - anywhere between 0-10 minutes
x = preprocess_1()
y = preprocess_2()
my_id = 1234
with some_api.instance(my_id):
# Once the instance is created, do some stuff with it in here
# Uses the preprocesses above
some_api.do_other_class_method_1(x)
some_api.do_other_class_method_2(y)
# Exited the with block - instance has been deleted
Which works fine. The problem is that creation of this instance always takes 60-90 seconds (as commented within the create_instance method), therefore if possible I would like to make this whole code more efficient by:
Starting the process of creating the instance (using a with block)
Only then, start the preprocessing (as commented, may take anywhere between 0-10 mins)
Once the preprocessing has been completed, use that with the instance
This order of operations would save up to 60 seconds each time, if the preprocessing happens to take more than 60 seconds. Note that there is no guarantee that the preprocessing will be longer or shorter than the creation of the instance.
I am aware of the existence of context.asynccontextmanager, but the whole async side of things does tie a knot in my brain. I have no idea how to get the order of operations right, while also maintaining the ability for the user to create and destroy the instance easily using a with statement.
Can anyone help?

How do you effectively cache a large django queryset?

I am working on a Django project using a PostgreSQL database in which I need to run a sometimes complex query on a user profile table with ~1M records, and the dataset returned can be anywhere from 0 to all 1M records. My issue is that once I grab all of the records, I want to be able to filter them further to run analytics on these profiles. The analytics cannot be completed in the same request/response loop, as this will time out for all but the smallest querysets. So I am using async javascript to shoot off new requests for each type of analytics I want.
For example, I will grab all of the profiles in the initial request and then i will send subsequent requests to take those profiles and return to me the % of genders or % of users who have a certain job title, etc. The issue is that every subsequent request has to run the original query again. What I would love to do is somehow cache that query and then run my query on this subset of the profile table without having to execute the original, long-running query.
I have tried to use a cache function to cache the queryset itself, but this actually slows down performance a bit, I assume because the queryset has to be pickled or unpickled and then it still has to run? I also tried to cache a list of ids from the parent long query (this is potentially a VERY long list, up to 1M integers) and that grinds my system to a completely halt for anything more than like 44k records.
Has anyone dealt with this kind of issue before? I know that I could set up a worker/queue system, which is on my roadmap, but it would be lovely if there was a simple solution to this that utilizes the built-in capabilities of Django.
Some sample code:
def get_analytics(request):
data_type = request.POST.get('data_type')
query_params = request.POST.get('query_params') # a crazy filter with lots of Q objects
profile_ids = get_profiles(query_params) # I WANT TO CACHE THIS
profiles = Profile.objects.filter(id__in=profile_ids).distinct()
if data_type == 'overview':
return profiles.count()
else if data_type == 'gender':
gender_breakdown = profiles.filter(a_filter_for_gender).values('gender').annotate(Count('gender', distinct=True))
return gender_breakdown
def cache_function(length):
"""
A variant of the snippet posted by Jeff Wheeler at
http://www.djangosnippets.org/snippets/109/
Caches a function, using the function and its arguments as the key, and the return
value as the value saved. It passes all arguments on to the function, as
it should.
The decorator itself takes a length argument, which is the number of
seconds the cache will keep the result around.
"""
def decorator(func):
def inner_func(*args, **kwargs):
if hasattr(settings, 'IS_IN_UNITTEST'):
return func(*args, **kwargs)
key = get_cache_key(func.__name__, func.__module__, args, kwargs)
value = cache.get(key)
if key in cache:
return value
else:
result = func(*args, **kwargs)
cache.set(key, result, length)
return result
return inner_func
return decorator
#cache_function(60*2)
def get_profiles(query_params):
return Profile.objects.filter(query_params).values_list('id')
Why does caching the ids slow my system down? Is there a better way to accomplish this?

ComputedProperty only updates on second put()

I have a ComputedProperty inside a StructuredProperty that does not get updated when the object is first created.
When I create the object address_components_ascii does not get saved. The field is not visible in the Datastore Viewer at all. But if I get() and then immediately put() again (even without changing anything), the ComputedProperty works as expected. The address_components field works properly.
I have tried clearing the database, and deleting the whole database folder, without success.
I am using the local dev server on windows 7. I have not tested it on GAE.
Here's the code:
class Item(ndb.Model):
location = ndb.StructuredProperty(Location)
The inner Location class:
class Location(ndb.Model):
address_components = ndb.StringProperty(repeated=True) # array of names of parent areas, from smallest to largest
address_components_ascii = ndb.ComputedProperty(lambda self: [normalize(part) for part in self.address_components], repeated=True)
The normalization function
def normalize(s):
return unicodedata.normalize('NFKD', s.decode("utf-8").lower()).encode('ASCII', 'ignore')
An example of the address_components field:
[u'114B', u'Drottninggatan', u'Norrmalm', u'Stockholm', u'Stockholm', u'Stockholms l\xe4n', u'Sverige']
and the address_components_ascii field, after the second put():
[u'114b', u'drottninggatan', u'norrmalm', u'stockholm', u'stockholm', u'stockholms lan', u'sverige']
The real problem seemed to be the order that GAE calls _prepare_for_put() on the StructuredProperty relative to the call to _pre_put_hook() of the surrounding Model.
I was writing to address_components in the Item._pre_put_hook(). I assume GAE computed the ComputedProperty of the StructuredProperty before calling the _pre_put_hook() on Item. Reading from the ComputedProperty causes its value to be recalculated.
I added this to the end of the _pre_put_hook():
# quick-fix: ComputedProperty not getting generated properly
# read from all ComputedProperties, to compute them again before put
_ = self.address_components_ascii
I'm saving the return value to a dummy variable to avoid IDE warnings.
I just tried this code on dev server and its worked. Computed property is accessible before and after put.
from google.appengine.ext import ndb
class TestLocation(ndb.Model):
address = ndb.StringProperty(repeated=True)
address_ascii = ndb.ComputedProperty(lambda self: [
part.lower() for part in self.address], repeated=True)
class TestItem(ndb.Model):
location = ndb.StructuredProperty(TestLocation)
item = TestItem(id='test', location=TestLocation(
address=['Drottninggatan', 'Norrmalm']))
assert item.location.address_ascii == ['drottninggatan', 'norrmalm']
item.put()
assert TestItem.get_by_id('test').location.address_ascii == [
'drottninggatan', 'norrmalm']
This seems to be a limitation in ndb. Simply doing a put() followed by a get() and another put() worked. It's slower, but only required when creating an object the first time.
I added this method:
def double_put(self):
return self.put().get().put()
which is a drop-in replacement for put().
When I put() a new object I call MyObject.double_put() instead of MyObject.put().

Django: how to do get_or_create() in a threadsafe way?

In my Django app very often I need to do something similar to get_or_create(). E.g.,
User submits a tag. Need to see if
that tag already is in the database.
If not, create a new record for it. If
it is, just update the existing
record.
But looking into the doc for get_or_create() it looks like it's not threadsafe. Thread A checks and finds Record X does not exist. Then Thread B checks and finds that Record X does not exist. Now both Thread A and Thread B will create a new Record X.
This must be a very common situation. How do I handle it in a threadsafe way?
Since 2013 or so, get_or_create is atomic, so it handles concurrency nicely:
This method is atomic assuming correct usage, correct database
configuration, and correct behavior of the underlying database.
However, if uniqueness is not enforced at the database level for the
kwargs used in a get_or_create call (see unique or unique_together),
this method is prone to a race-condition which can result in multiple
rows with the same parameters being inserted simultaneously.
If you are using MySQL, be sure to use the READ COMMITTED isolation
level rather than REPEATABLE READ (the default), otherwise you may see
cases where get_or_create will raise an IntegrityError but the object
won’t appear in a subsequent get() call.
From: https://docs.djangoproject.com/en/dev/ref/models/querysets/#get-or-create
Here's an example of how you could do it:
Define a model with either unique=True:
class MyModel(models.Model):
slug = models.SlugField(max_length=255, unique=True)
name = models.CharField(max_length=255)
MyModel.objects.get_or_create(slug=<user_slug_here>, defaults={"name": <user_name_here>})
... or by using unique_togheter:
class MyModel(models.Model):
prefix = models.CharField(max_length=3)
slug = models.SlugField(max_length=255)
name = models.CharField(max_length=255)
class Meta:
unique_together = ("prefix", "slug")
MyModel.objects.get_or_create(prefix=<user_prefix_here>, slug=<user_slug_here>, defaults={"name": <user_name_here>})
Note how the non-unique fields are in the defaults dict, NOT among the unique fields in get_or_create. This will ensure your creates are atomic.
Here's how it's implemented in Django: https://github.com/django/django/blob/fd60e6c8878986a102f0125d9cdf61c717605cf1/django/db/models/query.py#L466 - Try creating an object, catch an eventual IntegrityError, and return the copy in that case. In other words: handle atomicity in the database.
This must be a very common situation. How do I handle it in a threadsafe way?
Yes.
The "standard" solution in SQL is to simply attempt to create the record. If it works, that's good. Keep going.
If an attempt to create a record gets a "duplicate" exception from the RDBMS, then do a SELECT and keep going.
Django, however, has an ORM layer, with it's own cache. So the logic is inverted to make the common case work directly and quickly and the uncommon case (the duplicate) raise a rare exception.
try transaction.commit_on_success decorator for callable where you are trying get_or_create(**kwargs)
"Use the commit_on_success decorator to use a single transaction for all the work done in a function.If the function returns successfully, then Django will commit all work done within the function at that point. If the function raises an exception, though, Django will roll back the transaction."
apart from it, in concurrent calls to get_or_create, both the threads try to get the object with argument passed to it (except for "defaults" arg which is a dict used during create call in case get() fails to retrieve any object). in case of failure both the threads try to create the object resulting in multiple duplicate objects unless some unique/unique together is implemented at database level with field(s) used in get()'s call.
it is similar to this post
How do I deal with this race condition in django?
So many years have passed, but nobody has written about threading.Lock. If you don't have the opportunity to make migrations for unique together, for legacy reasons, you can use locks or threading.Semaphore objects. Here is the pseudocode:
from concurrent.futures import ThreadPoolExecutor
from threading import Lock
_lock = Lock()
def get_staff(data: dict):
_lock.acquire()
try:
staff, created = MyModel.objects.get_or_create(**data)
return staff
finally:
_lock.release()
with ThreadPoolExecutor(max_workers=50) as pool:
pool.map(get_staff, get_list_of_some_data())

Categories

Resources