Django: how to do get_or_create() in a threadsafe way? - python

In my Django app very often I need to do something similar to get_or_create(). E.g.,
User submits a tag. Need to see if
that tag already is in the database.
If not, create a new record for it. If
it is, just update the existing
record.
But looking into the doc for get_or_create() it looks like it's not threadsafe. Thread A checks and finds Record X does not exist. Then Thread B checks and finds that Record X does not exist. Now both Thread A and Thread B will create a new Record X.
This must be a very common situation. How do I handle it in a threadsafe way?

Since 2013 or so, get_or_create is atomic, so it handles concurrency nicely:
This method is atomic assuming correct usage, correct database
configuration, and correct behavior of the underlying database.
However, if uniqueness is not enforced at the database level for the
kwargs used in a get_or_create call (see unique or unique_together),
this method is prone to a race-condition which can result in multiple
rows with the same parameters being inserted simultaneously.
If you are using MySQL, be sure to use the READ COMMITTED isolation
level rather than REPEATABLE READ (the default), otherwise you may see
cases where get_or_create will raise an IntegrityError but the object
won’t appear in a subsequent get() call.
From: https://docs.djangoproject.com/en/dev/ref/models/querysets/#get-or-create
Here's an example of how you could do it:
Define a model with either unique=True:
class MyModel(models.Model):
slug = models.SlugField(max_length=255, unique=True)
name = models.CharField(max_length=255)
MyModel.objects.get_or_create(slug=<user_slug_here>, defaults={"name": <user_name_here>})
... or by using unique_togheter:
class MyModel(models.Model):
prefix = models.CharField(max_length=3)
slug = models.SlugField(max_length=255)
name = models.CharField(max_length=255)
class Meta:
unique_together = ("prefix", "slug")
MyModel.objects.get_or_create(prefix=<user_prefix_here>, slug=<user_slug_here>, defaults={"name": <user_name_here>})
Note how the non-unique fields are in the defaults dict, NOT among the unique fields in get_or_create. This will ensure your creates are atomic.
Here's how it's implemented in Django: https://github.com/django/django/blob/fd60e6c8878986a102f0125d9cdf61c717605cf1/django/db/models/query.py#L466 - Try creating an object, catch an eventual IntegrityError, and return the copy in that case. In other words: handle atomicity in the database.

This must be a very common situation. How do I handle it in a threadsafe way?
Yes.
The "standard" solution in SQL is to simply attempt to create the record. If it works, that's good. Keep going.
If an attempt to create a record gets a "duplicate" exception from the RDBMS, then do a SELECT and keep going.
Django, however, has an ORM layer, with it's own cache. So the logic is inverted to make the common case work directly and quickly and the uncommon case (the duplicate) raise a rare exception.

try transaction.commit_on_success decorator for callable where you are trying get_or_create(**kwargs)
"Use the commit_on_success decorator to use a single transaction for all the work done in a function.If the function returns successfully, then Django will commit all work done within the function at that point. If the function raises an exception, though, Django will roll back the transaction."
apart from it, in concurrent calls to get_or_create, both the threads try to get the object with argument passed to it (except for "defaults" arg which is a dict used during create call in case get() fails to retrieve any object). in case of failure both the threads try to create the object resulting in multiple duplicate objects unless some unique/unique together is implemented at database level with field(s) used in get()'s call.
it is similar to this post
How do I deal with this race condition in django?

So many years have passed, but nobody has written about threading.Lock. If you don't have the opportunity to make migrations for unique together, for legacy reasons, you can use locks or threading.Semaphore objects. Here is the pseudocode:
from concurrent.futures import ThreadPoolExecutor
from threading import Lock
_lock = Lock()
def get_staff(data: dict):
_lock.acquire()
try:
staff, created = MyModel.objects.get_or_create(**data)
return staff
finally:
_lock.release()
with ThreadPoolExecutor(max_workers=50) as pool:
pool.map(get_staff, get_list_of_some_data())

Related

Run code when "foreign" object is added to set

I have a foreign key relationship in my Django (v3) models:
class Example(models.Model):
title = models.CharField(max_length=200) # this is irrelevant for the question here
not_before = models.DateTimeField(auto_now_add=True)
...
class ExampleItem(models.Model):
myParent = models.ForeignKey(Example, on_delete=models.CASCADE)
execution_date = models.DateTimeField(auto_now_add=True)
....
Can I have code running/triggered whenever an ExampleItem is "added to the list of items in an Example instance"? What I would like to do is run some checks and, depending on the concrete Example instance possibly alter the ExampleItem before saving it.
To illustrate
Let's say the Example's class not_before date dictates that the ExampleItem's execution_date must not be before not_before I would like to check if the "to be saved" ExampleItem's execution_date violates this condition. If so, I would want to either change the execution_date to make it "valid" or throw an exception (whichever is easier). The same is true for a duplicate execution_date (i.e. if the respective Example already has an ExampleItem with the same execution_date).
So, in a view, I have code like the following:
def doit(request, example_id):
# get the relevant `Example` object
example = get_object_or_404(Example, pk=example_id)
# create a new `ExampleItem`
itm = ExampleItem()
# set the item's parent
itm.myParent = example # <- this should trigger my validation code!
itm.save() # <- (or this???)
The thing is, this view is not the only way to create new ExampleItems; I also have an API for example that can do the same (let alone that a user could potentially "add ExampleItems manually via REPL). Preferably the validation code must not be duplicated in all the places where new ExampleItems can be created.
I was looking into Signals (Django docu), specifically pre_save and post_save (of ExampleItem) but I think pre_save is too early while post_save is too late... Also m2m_changed looks interesting, but I do not have a many-to-many relationship.
What would be the best/correct way to handle these requirements? They seem to be rather common, I imagine. Do I have to restructure my model?
The obvious solution here is to put this code in the ExampleItem.save() method - just beware that Model.save() is not invoked by some queryset bulk operations.
Using signals handlers on your own app's models is actually an antipattern - the goal of signal is to allow for your app to hook into other app's lifecycle without having to change those other apps code.
Also (unrelated but), you can populate your newly created models instances directly via their initializers ie:
itm = ExampleItem(myParent=example)
itm.save()
and you can even save them directly:
# creates a new instance, populate it AND save it
itm = ExampleItem.objects.create(myParent=example)
This will still invoke your model's save method so it's safe for your use case.

modify `objects` to *always* return a subset of objects?

I have an Events table which contains various types of events. I care only about one of those types. As a result, every query I write begins with
Events.objects.filter(event_type="the_type").\
etc(...).etc(...)`.
Obviously this is repetitive and easy to forget. Is there a way to use a custom Manager so that the objects attribute always returns a specific subset of the rows, without explicitly asking for it? Or any other way to restrict the model to specific subset of the rows??
Yes, we can make a manager like:
from django.db import models
class EventManager(models.Manager):
def get_queryset(self):
return super(EventManager, self).get_queryset().filter(event_type="the_type")
and then add the manager to the Event class:
class Event(models.Model):
# ...
objects = EventManager()
Note however that some parts of Django will not use .objects, but ._base_manager, and thus still return the entire set. Furthermore my own experience with overriding the .objects manager is that it turns out to cause a lot of harm, for example if you want to set an attribute of all events, then writing Event.objects.all().update(foo='bar') will only update the events with the_type as type, whereas the code suggests otherwise.
Personally I think it is better to construct a manager with a different name, that at least hints that something is filtered, for example:
class Event(models.Model):
# ...
all_events = models.Manager()
type_events = EventManager()
here Event.objects no longer exist, but you write Event.all_events, or Event.type_events, and thus the code clearly hints what subset you take.

Django cannot delete single object after rewriting model.Manager method

I am trying to rewrite get_by_natural_key method on django manager (models.Manager). After adding model (NexchangeModel) I can delete all() objects but single - cannot.
Can:
SmsToken.objects.all().delete()
Cannot:
SmsTokent.objects.last().delete()
Code:
from django.db import models
from core.common.models import SoftDeletableModel, TimeStampedModel, UniqueFieldMixin
class NexchangeManager(models.Manager):
def get_by_natural_key(self, param):
qs = self.get_queryset()
lookup = {qs.model.NATURAL_KEY: param}
return self.get(**lookup)
class NexchangeModel(models.Model):
class Meta:
abstract = True
objects = NexchangeManager()
class SmsToken(NexchangeModel, SoftDeletableModel, UniqueFieldMixin):
sms_token = models.CharField(
max_length=4, blank=True)
user = models.ForeignKey(User, related_name='sms_token')
send_count = models.IntegerField(default=0)
While you are calling:
SmsToken.objects.all().delete() you are calling the queryset's delete method.
But on SmsTokent.objects.last().delete() you are calling the instance's delete method.
After django 1.9 queryset delete method returns no of items deleted. REF
Changed in Django 1.9:
The return value describing the number of objects deleted was added.
But on instance delete method Django already knows only one row will be deleted.
Also note that querset's delete method and instance's delete method are different.
The delete()[on a querset] method does a bulk delete and does not call any delete() methods on your models[instance method]. It does, however, emit the pre_delete and post_delete signals for all deleted objects (including cascaded deletions).
So you cannot rely on the response of the method to check if the delete worked fine or not. But in terms of python's philosophy "Ask for forgiveness than for permission". That means you can rely on exceptions to see if a delete has worked properly the way it should. Django's ORM will raise proper exceptions and do proper rollbacks in case of any failure.
So you can do this:
try:
instance.delete()/querset.delete()
except Exception as e:
# some code to notify failure / raise to propagate it
# you can avoid this try..except if you want to propagate exceptions as well.
Note: I am catching generic Exception because the only code in my try block is delete. If you wish to have some other code then you must catch specific exceptions only.
I assume that SoftDeletableModel comes from the django-model-utils package? If so, the purpose of that model is to mark instances with an is_removed field rather than actually deleting them. So it's to be expected that calling delete() on a model instance—which is what you get from last()—wouldn't actually delete anything.
SoftDeletableModel provides an objects attribute with a manager that limits its results to non-removed objects, and overrides delete() to mark objects as removed instead of actually deleting them.
The problem is that you've defined your own manager as objects, so the SoftDeletableModel manager isn't being used. Your custom manager is actually bulk deleting objects from the database, contrary to the goal of doing a soft delete. The way to resolve this is to have your custom manager inherit from SoftDeletableManagerMixin:
class NexchangeManager(SoftDeletableManagerMixin, models.Manager):
# your custom code

Django. Thread safe update or create.

We know, that update - is thread safe operation.
It means, that when you do:
SomeModel.objects.filter(id=1).update(some_field=100)
Instead of:
sm = SomeModel.objects.get(id=1)
sm.some_field=100
sm.save()
Your application is relativly thread safe and operation SomeModel.objects.filter(id=1).update(some_field=100) will not rewrite data in other model fields.
My question is.. If there any way to do
SomeModel.objects.filter(id=1).update(some_field=100)
but with creation of object if it does not exists?
from django.db import IntegrityError
def update_or_create(model, filter_kwargs, update_kwargs)
if not model.objects.filter(**filter_kwargs).update(**update_kwargs):
kwargs = filter_kwargs.copy()
kwargs.update(update_kwargs)
try:
model.objects.create(**kwargs)
except IntegrityError:
if not model.objects.filter(**filter_kwargs).update(**update_kwargs):
raise # re-raise IntegrityError
I think, code provided in the question is not very demonstrative: who want to set id for model?
Lets assume we need this, and we have simultaneous operations:
def thread1():
update_or_create(SomeModel, {'some_unique_field':1}, {'some_field': 1})
def thread2():
update_or_create(SomeModel, {'some_unique_field':1}, {'some_field': 2})
With update_or_create function, depends on which thread comes first, object will be created and updated with no exception. This will be thread-safe, but obviously has little use: depends on race condition value of SomeModek.objects.get(some__unique_field=1).some_field could be 1 or 2.
Django provides F objects, so we can upgrade our code:
from django.db.models import F
def thread1():
update_or_create(SomeModel,
{'some_unique_field':1},
{'some_field': F('some_field') + 1})
def thread2():
update_or_create(SomeModel,
{'some_unique_field':1},
{'some_field': F('some_field') + 2})
You want django's select_for_update() method (and a backend that supports row-level locking, such as PostgreSQL) in combination with manual transaction management.
try:
with transaction.commit_on_success():
SomeModel.objects.create(pk=1, some_field=100)
except IntegrityError: #unique id already exists, so update instead
with transaction.commit_on_success():
object = SomeModel.objects.select_for_update().get(pk=1)
object.some_field=100
object.save()
Note that if some other process deletes the object between the two queries, you'll get a SomeModel.DoesNotExist exception.
Django 1.7 and above also has atomic operation support and a built-in update_or_create() method.
You can use Django's built-in get_or_create, but that operates on the model itself, rather than a queryset.
You can use that like this:
me = SomeModel.objects.get_or_create(id=1)
me.some_field = 100
me.save()
If you have multiple threads, your app will need to determine which instance of the model is correct. Usually what I do is refresh the model from the database, make changes, and then save it, so you don't have a long time in a disconnected state.
It's impossible in django do such upsert operation, with update. But queryset update method return number of filtered fields so you can do:
from django.db import router, connections, transaction
class MySuperManager(models.Manager):
def _lock_table(self, lock='ACCESS EXCLUSIVE'):
cursor = connections[router.db_for_write(self.model)]
cursor.execute(
'LOCK TABLE %s IN %s MODE' % (self.model._meta.db_table, lock)
)
def create_or_update(self, id, **update_fields):
with transaction.commit_on_success():
self.lock_table()
if not self.get_query_set().filter(id=id).update(**update_fields):
self.model(id=id, **update_fields).save()
this example if for postgres, you can use it without sql code, but update or insert operation will not be atomic. If you create a lock on table you will be sure that two objects will be not created in two other threads.
I think if you have critical demands on atom operations. You'd better design it in database level instead of Django ORM level.
Django ORM system is focusing on convenience instead of performance and safety. You have to optimize the automatic generated SQL sometimes.
"Transaction" in most productive databases provide database lock and rollback well.
In mashup(hybrid) systems, or say your system added some 3rd part components, like logging, statistics. Application in different framework or even language may access database at the same time, adding thread safe in Django is not enough in this case.
SomeModel.objects.filter(id=1).update(set__some_field=100)

In django, how to delete all related objects when deleting a certain type of instances?

I first tried to override the delete() method but that doesn't work for QuerySet's bulk delete method. It should be related to pre_delete signal but I can't figure it out. My code is as following:
def _pre_delete_problem(sender, instance, **kwargs):
instance.context.delete()
instance.stat.delete()
But this method seems to be called infinitely and the program runs into a dead loop.
Can someone please help me?
If the class has foreign keys (or related objects) they are deleted by default like a DELETE CASCADE in sql.
You can change the behavior using the on_delete argument when defining the ForeignKey in the class, but by default it is CASCADE.
You can check the docs here.
Now the pre_delete signal works, but it doesn't call the delete() method if you are using a bulk delete, since its not deleting in a object by object basis.
In your case, using the post_delete signal instead of pre_delete should fix the infinite loop issue. Due to a ForeignKey's on_delete default value of cascade, using pre_delete logic this way will trigger the instance.context object to call delete on instance, which will then call instance.context, and so forth.
Using this approach:
def _post_delete_problem(sender, instance, **kwargs):
instance.context.delete()
instance.stat.delete()
post_delete.connect(_post_delete_problem, sender=Foo)
Can do the cleanup you want.
If you'd like a quick one-off to delete an instance and all of its related objects and those related objects' objects and so on without having to change the DB schema, you can do this -
def recursive_delete(to_del):
"""Recursively delete an object, all of its protected related
instances, those instances' protected instances, and so on.
"""
from django.db.models import ProtectedError
while True:
try:
to_del_pk = to_del.pk
if to_del_pk is None:
return # unsaved object
to_del.delete()
print(f"Deleted {to_del.__class__.__name__} with pk {to_del_pk}: {to_del}")
except ProtectedError as e:
for protected_ob in e.protected_objects:
recursive_delete(protected_ob)
Be careful, though!
I'd only use this to help with debugging in one-off scripts (or on the shell) with test databases that I don't mind wiping. Relationships aren't always obvious and if something is protected, it's probably for a reason.

Categories

Resources