Does it possible that transaction.atomic does not work as expected? - python

This is a DRF API View for entry like. When someone like a entry, i will insert a like record into table entry_like, and plus by 1 to field likes_num in another table entry. But, something went wrong that some of the count of entry_like records corresponding to one entry is less than the field likes_num in table entry. I do not know why it does not work as expected even the post method is with decorator transaction.atomic on. Are there some cases that the decorator transaction.atomic does not run as expected?

Yes, I think it is the case that transaction.atomic() does not work the way you expect.
To understand what it does, you have to understand SQL's transaction isolation levels and exactly what behavior they guarantee. You don't mention what database you're using, but PostgreSQL has good documentation on the subject.
Your expectation seems to be that it will work as if the isolation level was SERIALIZABLE. In fact, the default isolation level in Django is READ COMMITTED. And in that isolation level, if you have two of these transactions operating at once, they will both overwrite likes_num with the same number.
One solution is to use an F-object instead of setting likes_num to a specific value. In that case, the new value will be based on whatever value is in the field at the time of the write, rather than what value was in the field at the earlier point when you read the row.
entry.likes_num = F('likes_num') + 1
The other solution is to use select_for_update(), which will lock the entry row. It's better to avoid locks if you can, so I would opt for the F-object version.

I think you need to use F objects
from django.db.models import F
...
entry.likes_num = F('likes_num') + 1
entry.save()
Because you do not have any errors in code execution and two transactions are valid.

Related

When updating unpredictable columns out of many, how to keep the other columns with their current values?

Using Python and the built-in Sqlite:
When I use a prepared statement, not knowing what columns the user will want to update:
UPDATE_AUTHOR = """UPDATE lastName=?, firstName=?, age=?, nationality=? FROM authors
WHERE _id = ?
"""
How can I replace the '?' with some value that will keep the current value for some of the columns?
From the user perspective, I will tell him for example to press 'Enter' if he wishes to keep the current value. So for instance, he presses 'Enter' on lastName, updates firstName, updates age, and presses 'Enter' on nationality. And I replace his 'Enter' with, hopefully, a value that will keep the current value.
Is that possible? If not, how can I solve this problem differently but efficiently?
I thought about building the prepared statement dynamically, in the above example: adding firstName=?, and age=?, after "UPDATE, and then the rest of the statement FROM authors WHERE _id = ?". But this seems less comfortable and less organized.
There are 2 ways of handling this question. One is to build a specific UPDATE query containing only the fields that will change. As you have said it is less comfortable because the query and the parameter list have to be tweaked.
Another way it to consistently update all the parameters, but keep the saved values for those which should not change. This is a common design in user interfaces:
the user is presented all the values for an object and can change some of them
if they confirm their choice, the application retrieves all the values, either changed or not and uses them in an UPDATE query.
Anyway, it is common the read the all the values before changing some, so it is not necessarily expensive. And at the database level, changing one or more values in an update has generally almost the same cost: a record is loaded from disk (or cache), some values are updated which is the cheapest operation, and it is then written back to disk. Even with the database caches, the most expensive part in the databases I know is to load and save the record.

self excluding excecution of functions in django

i have a django model that has values stored in a json field.
but some of the values have to be unique, for that i have a function check_unique().
but this check fails if two users try to save the same value at the same time since when check_unique() runs neither of the values is stored in the database and then they are individually correct.
There is a way to avoid this behavior?
i tried avoiding this using trheadeing.Lock but apache runs in different processes and it does'nt work in that case.
Besides, i would like that the check would be at application level (in python) and not at database level.
the code looks like this:
semaphore.claim()
try:
uniques = check_unique(self.answers)
if not uniques:
self.go_save()
semaphore.release()
return Response("All OK")
except Exception as e:
semaphore.release()
return e
Have you looked at atomic requests in Django? https://docs.djangoproject.com/en/1.11/topics/db/transactions/#django.db.transaction.atomic
Aas far as I know, if you make that transaction with the database atomic, it would not be able for the second user to save it to the database. What atomic does, basically, is that it will try to save the values to the database, but when it recognizes an exception, it will revert those changes to the database.
Another approach would be to create a custom validator, that checks the value twice. So once it will check to see if the value exists in the db, and then x seconds later (randomize that so both users won't be able to validate the exact same time). That way, if user 1 has already saved it to the database, the second check would see it and cancel the operation.
Finally, check out https://github.com/alecthomas/voluptuous. May be it does what you need!
Hope that helps.

Getting a list of results, 1 for each foreign key

I have a model, Reading, which has a foreign key, Type. I'm trying to get a reading for each type that I have, using the following code:
for type in Type.objects.all():
readings = Reading.objects.filter(
type=type.pk)
if readings.exists():
reading_list.append(readings[0])
The problem with this, of course, is that it hits the database for each sensor reading. I've played around with some queries to try to optimize this to a single database call, but none of them seem efficient. .values for instance will provide me a list of readings grouped by type, but it will give me EVERY reading for each type, and I have to filter them with Python in memory. This is out of the question, as we're dealing with potentially millions of readings.
if you use PostgreSQL as your DB backend you can do this in one-line with something like:
Reading.objects.order_by('type__pk', 'any_other_order_field').distinct('type__pk')
Note that the field on which distinct happens must always be the first argument in the order_by method. Feel free to change type__pk with the actuall field you want to order types on (e.g. type__name if the Type model has a name property). You can read more about distinct here https://docs.djangoproject.com/en/dev/ref/models/querysets/#distinct.
If you do not use PostgreSQL you could use the prefetch_related method for this purpose:
#reading_set could be replaced with whatever your reverse relation name actually is
for type in Type.objects.prefetch_related('reading_set').all():
readings = type.reading_set.all()
if len(readings):
reading_list.append(readings[0])
The above will perform only 2 queries in total. Note I use len() so that no extra query is performed when counting the objects. You can read more about prefetch_related here https://docs.djangoproject.com/en/dev/ref/models/querysets/#prefetch-related.
On the downside of this approach is you first retrieve all related objects from the DB and then just get the first.
The above code is not tested, but I hope it will at least point you towards the right direction.

Strongly consistent queries for root entities in GAE?

I'd like some advice on the best way to do a strongly consistent read/write in Google App Engine.
My data is stored in a class like this.
class UserGroupData(ndb.Model):
users_in_group = ndb.StringProperty(repeated=True)
data = ndb.StringProperty(repeated=True)
I want to write a safe update method for this data. As far as I understand, I need to avoid eventually consistent reads here, because they risk data loss. For example, the following code is unsafe because it uses a vanilla query which is eventually consistent:
def update_data(user_id, additional_data):
entity = UserGroupData.query(UserGroupData.users_in_group==user_id).get()
entity.data.append(additional_data)
entity.put()
If the entity returned by the query is stale, data is lost.
In order to achieve strong consistency, it seems I have a couple of different options. I'd like to know which option is best:
Option 1:
Use get_by_id(), which is always strongly consistent. However, there doesn't seem to be a neat way to do this here. There isn't a clean way to derive the key for UserGroupData directly from a user_id, because the relationship is many-to-one. It also seems kind of brittle and risky to require my external clients to store and send the key for UserGroupData.
Option 2:
Place my entities in an ancestor group, and perform an ancestor query. Something like:
def update_data(user_id, additional_data):
entity = UserGroupData.query(UserGroupData.users_in_group==user_id,
ancestor=ancestor_for_all_ugd_entities()).get()
entity.data.append(additional_data)
entity.put()
I think this should work, but putting all UserGroupData entities into a single ancestor group seems like an extreme thing to do. It results in writes being limited to ~1/sec. This seems like the wrong approach, since each UserGroupData is actually logically independent.
Really what I'd like to do is perform a strongly consistent query for a root entity. Is there some way to do this? I noticed a suggestion in another answer to essentially shard the ancestor group. Is this the best that can be done?
Option 3:
A third option is to do a keys_only query followed by get_by_id(), like so:
def update_data(user_id, additional_data):
entity_key = UserGroupData.query(UserGroupData.users_in_group==user_id,
).get(keys_only=True)
entity = entity_key.get()
entity.data.append(additional_data)
entity.put()
As far as I can see this method is safe from data loss, since my keys are not changing and the get() gives strongly consistent results. However, I haven't seen this approach mentioned anywhere. Is this a reasonable thing to do? Does it have any downsides I need to understand?
I think you are also conflating the issue of inconsistent queries with safe updates of the data.
A query like the one in your example UserGroupData.query(UserGroupData.users_in_group==user_id).get() will always only return one entity, if the user_id is in the group.
If it has only just been added and the index is not up to date then you won't get a record and therefore you won't update the record.
Any update irrespective of the method of fetching the entity should be performed inside a transaction ensuring update consistency.
As to ancestors improving the consistency of the query, it's not obvious if you plan to have multiple UserGroupData entities. In which case why are you doing a get().
So option 3, is probably your best bet, do the keys only query, then inside a transaction do the Key.get() and update. Remember cross group transactions are limited 5 entity groups.
Given this approach if the index the query is based is out of date then 1 of 3 things can happen,
the record you want isn't found because the newly added userid is not reflected in the index.
the record you want is found, the get() will fetch it consistently
the record you want is found, but the userid has actually been removed and the index is out of date. The get() will retrieve the index consistently and the userid is not present.
You code can then decide what course of action.
What is the use case for querying all UserGroupData entities that a particular user is a member of that would require updates ?

How do to explicitly define the query used in subqueryload_all?

I'm using subqueryload/subqueryload_all pretty heavily, and I've run into the edge case where I tend to need to very explicitly define the query that is used during the subqueryload. For example I have a situation where I have posts and comments. My query looks something like this:
posts_q = db.query(Post).options(subqueryload(Post.comments))
As you can see, I'm loading each Post's comments. The problem is that I don't want all of the posts' comments, I need to also take into account a deleted field, and they need to be ordered by create time descending. The only way I have observed this being done, is by adding options to the relationship() declaration between posts and comments. I would prefer not to do this, b/c it means that that relationship cannot be reused everywhere after that, as I have other places in the app where those constraints may not apply.
What I would love to do, is explicitly define the query that subqueryload/subqueryload_all uses to load the posts' comments. I read about DisjointedEagerLoading here, and it looks like I could simply define a special function that takes in the base query, and a query to load the specified relationship. Is this a good route to take for this situation? Anyone ever run into this edge case before?
The answer is that you can define multiple relationships between Posts and Comments:
class Post(...):
active_comments = relationship(Comment,
primary_join=and_(Comment.post_id==Post.post_id, Comment.deleted=False),
order_by=Comment.created.desc())
Then you should be able to subqueryload by that relationship:
posts_q = db.query(Post).options(subqueryload(Post.active_comments))
You can still use the existing .comments relationship elsewhere.
I also had this problem and it took my some time to realize that this is an issue by design. When you say Post.comments then you refer to the relationship that says "these are all the comments of that post". However, now you want to filter them. If you'd now specify that condition somewhere on subqueryload then you are essentially loading only a subset of values into Post.comments. Thus, there will be values missing. Essentially you have a faulty representation of your data in the model.
The question here is how to approach this then, because you obviously need this value somewhere. The way I go is building the subquery myself and then specify special conditions there. That means you get two objects back: The list of posts and the list of comments. That is not a pretty solution, but at least it is not displaying data in a wrong way. If you were to access Post.comments for some reason, you can safely assume it contains all posts.
But there is room for improvement: You might want to have this attached to your class so you don't carry around two variables. The easy way might be to define a second relationship, e.g. published_comments which specifies extra parameters. You could then also control that no-one writes to it, e.g. with attribute events. In these events you could, instead of forbidding manipulation, handle how manipulation is allowed. The only problem might be when updates happen, e.g. when you add a comment to Post.comments then published_comments won't be updated automatically because they are not aware of each other. Again, I'd take events for this if this is a required feature (but with the above ugly solution you would not have that either).
As a last, hybrid, solution you could take the first approach and then just assign those values to your object, e.g. Post.deleted_comments = deleted_comments.
The thing to keep in mind here is that it is generally not a clever idea to manipulate the query the ORM makes as this could lead to problems later on. I have taken this approach and manipulated the queries (with contains_eager this is easily possible) but it has created problems on some points (while generally being functional) so I dropped that approach.

Categories

Resources