Reading this documentation https://docs.djangoproject.com/en/4.0/topics/db/transactions/#django.db.transaction.on_commit
This is the use case for on_commit
with transaction.atomic(): # Outer atomic, start a new transaction
transaction.on_commit(foo)
# Do things...
with transaction.atomic(): # Inner atomic block, create a savepoint
transaction.on_commit(bar)
# Do more things...
# foo() and then bar() will be called when leaving the outermost block
But why not just write the code like normal without on_commit hooks? Like this:
with transaction.atomic(): # Outer atomic, start a new transaction
# Do things...
with transaction.atomic(): # Inner atomic block, create a savepoint
# Do more things...
foo()
bar()
# foo() and then bar() will be called when leaving the outermost block
It's easier to read since it doesn't require more knowledge of the Django APIs and the statements are put in the order of when they are executed. It's easier to test since you don't have to use any special test classes for Django.
So what is the use-case for the on_commit hook?
The example code given in the Django docs is transaction.on_commit(lambda: some_celery_task.delay('arg1')) and it's probably specifically because this comes up a lot with celery tasks.
Imagine if you do the following within a transaction:
my_object = MyObject.objects.create()
some_celery_task.delay(my_object.pk)
Then in your celery task you try doing this:
#app.task
def some_celery_task(object_pk)
my_object = MyObject.objects.get(pk=object_pk)
This may work a lot of the time, but randomly you'll get errors where it's not able to find the object (depending on how fast the work task is run because it's a race condition). This is because you created a MyObject record within a transaction, but it isn't actually available in the database until a COMMIT is run. Celery has no access to that open transaction, so it needs to be run after the COMMIT. There's also the very real possibility that something later on causes a ROLLBACK and that celery task should never actually be called.
So... You need to do:
my_object = MyObject.objects.create()
transaction.on_commit(lambda: some_celery_task.delay(my_object.pk))
Now, the celery task won't be called until the MyObject has actually been saved to the database after the COMMIT was called.
I should note, though, this is primarily only a concern when you aren't using AUTOCOMMIT (which is actually the default). If you're in AUTOCOMMIT mode then you can be certain that a commit has been finished as part of a .create() or .save(). However, if you're code base has any possibility of being called within a #transaction.atomic() then it's no longer AUTOCOMMIT and you're back to needing .on_commit(), so it's best/safest to always use it.
Django documentation:
Django provides the on_commit() function to register callback functions that should be executed after a transaction is successfully committed
It is the main purpose. A transaction is a unit of work that you want to treat atomically. It either happens completely or not at all. The same applies to your code. If something went wrong during DB operations you might not need to do some things.
Let's consider some business logic flow:
User sends his registration data to our endpoint, we validate it, etc.
We save the new user to our DB.
We send him a "hello" letter to email with a link for confirming his account.
If something goes wrong during step 2, we shouldn't go to step 3.
We can think that, well, I'll get an exception and wouldn't execute that code as well. Why do we still need it?
Sometimes you take actions in your code based on an assumption of the transaction being successful before potentially dangerous DB operations. For example, you want firstly to check if can send an email to your user, because you know that your emailing 3rd-party often gives you 500. In that case, you want to raise a 500 for the user and ask him to register later (a very bad idea, btw, but it's just a synthetic example).
When your function (e.g. with #atomic decorator) contains a lot of DB operations you surely don't want to memorize all the variables states in order to use them after all DB-related code. Like this:
Validation of user's order.
Checking at DB if it could be completed.
If it could be done we need to send a request to 3rd-party CRM with the order's details.
If it couldn't, then we should create a support ticket in another 3rd-party.
Saving user's order to DB, updating user's model.
Sending a messenger notification to the employee who is responsible for the order.
Saving information, that notification for employee was sent successfully to the DB.
You can imagine what a mess would we have if we hadn't on_commit in this situation and we had a really big try-catch on this.
Related
My (CLI) app uses SQLAlchemy 1.3. One of the jobs it has to perform is to query for a large number of records (>300k), then do some calculations on those and insert new records based on the results of the processing.
The app writes its activites into a log table as well, so I can see what it is currently doing during that long running job. I have various piplines (tasks), so there is a "pipelines" table with a 1:many relationship to a "log_messages" table.
I am using the ORM style, omitting the model classes here. I think it's not relevant but let me know if I should add more details.
So the general flow is something like this:
def perform_task():
with session_scope() as session:
# get a pipeline record for our log messages to link to
pipeline = session.query(Pipelines).filter(Pipelines.name=='some_name')
# log the start of the work
pipeline.append(LogMessage( text="started work")
# query the records we are working on (>300k)
job_input_all = session.query(SomeModel).filter(SomeModel.is_of_interest = True ).all()
for job_input in job_input_all():
job_input.append(SomeOtherModel( something_calculated = _do_calculation(job_input, pipeline)))
pipeline.append(LogMessage( text="finished work")
def _do_calculation( job_input, pipeline ):
# of course this isn't the real calcualtion, just illsutrating that "something happens here"
# the real stuff is complex and takes a lot of time to compute
# and we need to write log messages from time to time
calculated_value = job_input.value * 1000
if calculated_value > 100000:
pipeline.append(LogMessage( text=f'input value {job_input.value} resulted in bad output {calculated_value}'))
If I do it that way, none of the log messages will appear until the session scope ends, which commits everything. As this job takes a long time, it is important that I get the logs updated in real time so I can see what is going on. How would I do this?
If I commit after each pipeline log message is created, I will invalidate (and force to re-query) the row result objects in job_input_all, which would be bad. Even more problematic: I can't commit logs in _do_calculation() because I don't want to commit all the calculated stuff yet.
I have worked with ORMs in other languages before but I am new to SQLA (and Python for that matter) so I'm probably missing something fundamental here. Thanks for your help!
My advice would be to not write logs in this manner. If you instead wrote logs to an ELK (Elasticsearch/Logstash/Kibana) stack you could make it independent of your session and have all the nice inbuilt log-related features that Kibana gives you out of the box, in an easy to use web-GUI.
I've got a simple webservice which uses SQLAlchemy to connect to a database using the pattern
engine = create_engine(database_uri)
connection = engine.connect()
In each endpoint of the service, I then use the same connection, in the following fashion:
for result in connection.execute(query):
<do something fancy>
Since Sessions are not thread-safe, I'm afraid that connections aren't either.
Can I safely keep doing this? If not, what's the easiest way to fix it?
Minor note -- I don't know if the service will ever run multithreaded, but I'd rather be sure that I don't get into trouble when it does.
Short answer: you should be fine.
There is a difference between a connection and a Session. The short description is that connections represent just that… a connection to a database. Information you pass into it will come out pretty plain. It won't keep track of your transactions unless you tell it to, and it won't care about what order you send it data. So if it matters that you create your Widget object before you create your Sprocket object, then you better call that in a thread-safe context. Same generally goes for if you want to keep track of a database transaction.
Session, on the other hand, keeps track of data and transactions for you. If you check out the source code, you'll notice quite a bit of back and forth over database transactions and without a way to know that you have everything you want in a transaction, you could very well end up committing in one thread while you expect to be able to add another object (or several) in another.
In case you don't know what a transaction is this is Wikipedia, but the short version is that transactions help make sure your data stays stable. If you have 15 inserts and updates, and insert 15 fails, you might not want to make the other 14. A transaction would let you cancel the entire operation in bulk.
How can I get a task by name?
from google.appengine.api import taskqueue
taskqueue.add(name='foobar', url='/some-handler', params={'foo': 'bar'}
task_queue = taskqueue.Queue('default')
task_queue.delete_tasks_by_name('foobar') # would work
# looking for a method like this:
foobar_task = task_queue.get_task_by_name('foobar')
It should be possible with the REST API (https://developers.google.com/appengine/docs/python/taskqueue/rest/tasks/get). But I would prefer something like task_queue.get_task_by_name('foobar'). Any ideas? Did I miss something?
There is no guarantee that the task with this name exists - it may have been already executed. And even if you manage to get a task, it may be executed while you are trying to do something with it. So when you try to put it back, you have no idea if you are adding it for the first time or for the second time.
Because of this uncertainty, I can't see any use case where getting a task by name may be useful.
EDIT:
You can give a name to your task in order to ensure that a particular task only executes once. When you add a task with a name to a queue, App Engine will check if the task with such name already exists. If it does, the subsequent attempt will fail.
For example, you can have many instances running, and each instance may need to insert an entity in the Datastore. Your first option is to check if an entity already exists in a datastore. This is a relatively slow operation, and by the time you received your response (entity does not exist) and decide to insert it, another instance could have already inserted it. So you end up with two entities instead of one.
Your second option is to use tasks. Instead of inserting a new entity directly into a datastore, an instance creates a task to insert it, and it gives this task a name. If another instance tries to add a task with the same name, it will simply override the existing task. As a result, you are guaranteed that an entity will be inserted only once.
I have tens (potentially hundreds) of thousands of persistent objects that I want to generate in a multithreaded fashion due the processing required.
While the creation of the objects happens in separate threads (using Flask-SQLAlchemy extension btw with scoped sessions) the call to write the generated objects to the DB happens in 1 place after the generation has completed.
The problem, I believe, is that the objects being created are part of several existing relationships-- thereby triggering the automatic addition to the identity map despite being created in separate, concurrent, threads with no explicit session in any of the threads.
I was hoping to contain the generated objects in a single list, and then write the whole list (using a single session object) to the database. This results in an error like this:
AssertionError: A conflicting state is already present in the identity map for key (<class 'app.ModelObject'>, (1L,))
Hence why I believe the identity map has already been populated, because it's when I try to add and commit using the global session outside of the concurrent code, the assertion error is triggered.
The final detail is that whatever session object(s), (scoped or otherwise, as I don't fully understand how automatic addition to the identity map works in the case of multithreading) I cannot find a way / don't know how to get a reference to them so that even if I wanted to deal with a separate session per process I could.
Any advice is greatly appreciated. The only reason I am not posting code (yet) is because it's difficult to abstract a working example immediately out of my app. I will post if somebody really needs to see it though.
Each session is thread-local; in other words there is a separate session for each thread. If you decide to pass some instances to another thread, they will become "detached" from the session. Use db.session.add_all(objects) in the receiving thread to put them all back.
For some reason, it looks like you're creating objects with the same identity (primary key columns) in different threads, then trying to send them both to the database. One option is to fix why this is happening, so that identities will be guaranteed unique. You may also try merging; merged_object = db.session.merge(other_object, load=False).
Edit: zzzeek's comment clued me in on something else that may be going on:
With Flask-SQLAlchemy, the session is tied to the app context. Since that is thread local, spawning a new thread will invalidate the context; there will be no database session in the threads. All the instances are detached there, and cannot properly track relationships. One solution is to pass app to each thread and perform everything within a with app.app_context(): block. Inside the block, first use db.session.add to populate the local session with the passed instances. You should still merge in the master task afterwards to ensure consistency.
I just want to clarify the problem and the solution with some pseudo-code in case somebody has this problem / wants to do this in the future.
class ObjA(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjB(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjC(object):
obj_a_id = Column(Integer, ForeignKey('obj_a.id'))
obj_b_id = Column(Integer, ForeignKey('obj_b.id'))
def __init__(self, obj_a, obj_b):
self.obj_a = obj_a
self.obj_b = obj_b
def make_a_bunch_of_c(obj_a, list_of_b=None):
return [ObjC(obj_a, obj_b) for obj_b in list_of_b]
def parallel_generate():
list_of_a = session.query(ObjA).all() # assume there are 1000 of these
list_of_b = session.query(ObjB).all() # and 30 of these
fxn = functools.partial(make_a_bunch_of_c, list_of_b=list_of_b)
pool = multiprocessing.Pool(10)
all_the_things = pool.map(fxn, list_of_a)
return all_the_things
Now let's stop here a second. The original problem was that attempting to ADD the list of ObjC's caused the error message in the original question:
session.add_all(all_the_things)
AssertionError: A conflicting state is already present in the identity map for key [...]
Note: The error occurs during the adding phase, the commit attempt never even happens because the assertion occurs pre-commit. As far as I could tell.
Solution:
all_the_things = parallel_generate()
for thing in all_the_things:
session.merge(thing)
session.commit()
The details of session utilization when dealing with automatically added objects (via the relationship cascading) is still beyond me and I cannot explain why the conflict originally occurred. All I know is that using the merge function will cause SQLAlchemy to sort all of the child objects that were created across 10 different processes into a single session in the master process.
I would be curious in the why, if anyone happens across this.
So this isn't necessarily a Django question, I'm just having a mental block getting my head around the logic, but I suppose Django might provide some ways to manually lock records that would be helpful.
Essentially, a user may upload one or many files at a time. Each file is uploaded via a separate request. When the user goes above 90% storage quota, I'd like to send an email to them notifying them as such, but I only want to send a single email. So my current workflow is to check their usage, make sure they have not yet been sent a reminder, and :
if usage_pct >= settings.VDISK_HIGH_USAGE_THRESHOLD and disk.last_high_usage_reminder is None:
disk.last_high_usage_reminder = timezone.now()
disk.save()
vdisks_high_usage_send_notice(user)
The above code however often lets more than one email through. So my first thought is to somehow lock the disk record before even checking the value, and then unlock it after saving it. Is it possible and/or advisable, or is there a better method to this?
OK I'm quietly confident I've solved the problem using this answer: https://stackoverflow.com/a/7794220/698289
Firstly, implement this utility function:
#transaction.commit_manually
def flush_transaction():
transaction.commit()
And then modify my original code to flush and reload the disk record:
flush_transaction()
disk = user.profile.diskstorage
if usage_pct >= settings.VDISK_HIGH_USAGE_THRESHOLD and disk.last_high_usage_reminder is None:
disk.last_high_usage_reminder = timezone.now()
disk.save()
vdisks_high_usage_send_notice(user)