Detaching SQLAlchemy instance so no refresh happens - python

I want to detach an instance of a class from my session but it should still be available for reading (without emitting a query). I have been scanning through the documentation for days now, but every approach I try leads to the message
DetachedInstanceError: Instance <MyModel at 0x36bb190> is not bound to a Session;
attribute refresh operation cannot proceed
I am working with the zope.sqlalchemy transaction manager in Pyramid. I want my object to be usable after the transaction has been committed. I only need it to read the "cached" value, i.e. those that were in it before the transaction was commited.
The only possible solution I could find was to wrap the class (or the attributes itself) and then track the changes manually (I could do that but it is really ugly and not at all Pythonic).
Is there a way I can prevent SQLAlchemy from trying to refresh these values?
As a fallback I would even be open to just returning None, as long as the above error doesn't get thrown after the transaction as been committed

You can do exactly this (for example for caching) by doing:
session.expunge(obj)
As per sqlalchemy documentation:
http://docs.sqlalchemy.org/en/rel_1_0/orm/session_api.html?highlight=expire#sqlalchemy.orm.session.Session.expunge
This will give you object that is in detached state that you can safely use - you need to remember you will not be able to read properties that would emit another query thus being tied to session - like relationship, this will end up with DetachedInstanceError.
By default a commit will issue an expire_all() which means all objects will refresh their state on read, by expunging them you detach them from the session so there should be no subsequent queries after you commit your transaction.
I would advise against disabling this functionality globally as the other comments suggest as Mike Bayer generally suggest it is a good idea and a sane default for most people that can save you headaches in long run.
Just expunge things explictly when you need them.

http://docs.sqlalchemy.org/en/latest/orm/session_api.html
I think you're looking for expire_on_commit = False
I believe this allows you to detach the object and continue to use it. Trying to modify it and commit however will lead to the DetachedInstanceError.

try this:
DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension(), expire_on_commit=False))
I had this problem too and using expire_on_commit=False solved my problem.

#contextmanager
def make_session_scope(Session):
"""Provide a transactional scope around a series of operations."""
session = Session()
session.expire_on_commit = False
try:
yield session
session.commit()
except:
session.rollback()
raise
finally:
session.close()
with make_session_scope(session) as session:
query = session.query(model)

Related

What is the use case for Django's on_commit?

Reading this documentation https://docs.djangoproject.com/en/4.0/topics/db/transactions/#django.db.transaction.on_commit
This is the use case for on_commit
with transaction.atomic(): # Outer atomic, start a new transaction
transaction.on_commit(foo)
# Do things...
with transaction.atomic(): # Inner atomic block, create a savepoint
transaction.on_commit(bar)
# Do more things...
# foo() and then bar() will be called when leaving the outermost block
But why not just write the code like normal without on_commit hooks? Like this:
with transaction.atomic(): # Outer atomic, start a new transaction
# Do things...
with transaction.atomic(): # Inner atomic block, create a savepoint
# Do more things...
foo()
bar()
# foo() and then bar() will be called when leaving the outermost block
It's easier to read since it doesn't require more knowledge of the Django APIs and the statements are put in the order of when they are executed. It's easier to test since you don't have to use any special test classes for Django.
So what is the use-case for the on_commit hook?
The example code given in the Django docs is transaction.on_commit(lambda: some_celery_task.delay('arg1')) and it's probably specifically because this comes up a lot with celery tasks.
Imagine if you do the following within a transaction:
my_object = MyObject.objects.create()
some_celery_task.delay(my_object.pk)
Then in your celery task you try doing this:
#app.task
def some_celery_task(object_pk)
my_object = MyObject.objects.get(pk=object_pk)
This may work a lot of the time, but randomly you'll get errors where it's not able to find the object (depending on how fast the work task is run because it's a race condition). This is because you created a MyObject record within a transaction, but it isn't actually available in the database until a COMMIT is run. Celery has no access to that open transaction, so it needs to be run after the COMMIT. There's also the very real possibility that something later on causes a ROLLBACK and that celery task should never actually be called.
So... You need to do:
my_object = MyObject.objects.create()
transaction.on_commit(lambda: some_celery_task.delay(my_object.pk))
Now, the celery task won't be called until the MyObject has actually been saved to the database after the COMMIT was called.
I should note, though, this is primarily only a concern when you aren't using AUTOCOMMIT (which is actually the default). If you're in AUTOCOMMIT mode then you can be certain that a commit has been finished as part of a .create() or .save(). However, if you're code base has any possibility of being called within a #transaction.atomic() then it's no longer AUTOCOMMIT and you're back to needing .on_commit(), so it's best/safest to always use it.
Django documentation:
Django provides the on_commit() function to register callback functions that should be executed after a transaction is successfully committed
It is the main purpose. A transaction is a unit of work that you want to treat atomically. It either happens completely or not at all. The same applies to your code. If something went wrong during DB operations you might not need to do some things.
Let's consider some business logic flow:
User sends his registration data to our endpoint, we validate it, etc.
We save the new user to our DB.
We send him a "hello" letter to email with a link for confirming his account.
If something goes wrong during step 2, we shouldn't go to step 3.
We can think that, well, I'll get an exception and wouldn't execute that code as well. Why do we still need it?
Sometimes you take actions in your code based on an assumption of the transaction being successful before potentially dangerous DB operations. For example, you want firstly to check if can send an email to your user, because you know that your emailing 3rd-party often gives you 500. In that case, you want to raise a 500 for the user and ask him to register later (a very bad idea, btw, but it's just a synthetic example).
When your function (e.g. with #atomic decorator) contains a lot of DB operations you surely don't want to memorize all the variables states in order to use them after all DB-related code. Like this:
Validation of user's order.
Checking at DB if it could be completed.
If it could be done we need to send a request to 3rd-party CRM with the order's details.
If it couldn't, then we should create a support ticket in another 3rd-party.
Saving user's order to DB, updating user's model.
Sending a messenger notification to the employee who is responsible for the order.
Saving information, that notification for employee was sent successfully to the DB.
You can imagine what a mess would we have if we hadn't on_commit in this situation and we had a really big try-catch on this.

SQLAlchemy + pyTelegramBotAPI: SQLite objects created in a thread can only be used in that same thread

i have a real headache from trying to understand the cause of the following problem. We are using a combination of the following libraries:
pyTelegramBotAPI to process requests in a multi-threaded way
SQLAlchemy
sqlite
The SQLAlchemy was first using NullPool and now is configured to utilize QueuePool. I am also using the following idiom to have a new DB session firing up for each thread (as per my understanding)
Session = sessionmaker(bind=create_engine(classes.db_url, poolclass=QueuePool))
#contextmanager
def session_scope():
session = Session()
try:
yield session
session.commit()
except:
session.rollback()
raise
finally:
session.close()
#bot.message_handler(content_types=['document'])
def method_handler:
with session_scope() as session:
do_database_stuff_here(session)
Nevertheless I am still getting this annoying exception: (sqlite3.ProgrammingError) SQLite objects created in a thread can only be used in that same thread
Any ideas? ;) In particular, i don't get how it is possible for another tread to get somewhere in between the db operations...this is probably the reason of the pesky exception
update 1: if i change the poolclass to SingletonThreadPool, then there seems to be no more errors coming up. However, the documentation of SQLAlchemy tells that it's not production rife.
As you can see in the source, sqlite will raise this exception inside pysqlite_check_thread if the connection object is reused across any threads.
By using a QueuePool, you are telling SQLAchemy it is safe to reuse connections across multiple threads. It will therefore just pick a connection from the pool for any session no matter which thread it is on. This is why you're hitting the error. The first time you create and use a connection, you'll be fine; however the next use will probably be on a different thread and so fail the check.
This is why SQLAlchemy mandates the use of other pools such as SingletonThreadPool and NullPool.
Assuming you are using a file based database, you should use the NullPool. This will give you good concurrency on reads. Write access concurrency is always going to be an issue for sqlite; if you need this, you probably want a diffenet database.
Something that may be worth trying: use scoped_session instead of your contextmanager; scoped_session implicitly creates a thread-local session when it is accessed from a different thread. Be sure also to use NullPool.
from sqlalchemy.orm import scoped_session
sessionmaker(bind=create_engine(classes.db_url, poolclass=NullPool))
session = scoped_session()
Note that you can use this scoped session directly as if it were just a regular session, even though it is actually creating thread-local sessions behind the scenes when it is being used.
For scoped_session, should call session.remove() for after you're done (i.e., after each method_handler) and explicitly call session.commit() as needed.
In theory, your context manager should work in giving each thread its own session, but, for lack of better explanation, I wonder if there are multiple threads accessing that session within the context.

SQLAlchemy long running script: User was holding a relation lock for too long

I have an SQLAlchemy session in a script. The script is running for a long time, and it only fetches data from database, never updates or inserts.
I get quite a lot of errors like
sqlalchemy.exc.DBAPIError: (TransactionRollbackError) terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.
The way I understand it, SQLAlchemy creates a transaction with the first select issued, and then reuses it. As my script may run for about an hour, it is very likely that a conflict comes up during the lifetime of that transaction.
To get rid of the error, I could use autocommit in te deprecated mode (without doing anything more), but this is explicitly discouraged by the documentation.
What is the right way to deal with the error? Can I use ORM queries without transactions at all?
I ended up closing the session after (almost) every select, like
session.query(Foo).all()
session.close()
since I do not use autocommit, a new transaction is automatically opened.

SQLAlchemy - Multithreaded Persistent Object Creation, how to merge back into single session to avoid state conflict?

I have tens (potentially hundreds) of thousands of persistent objects that I want to generate in a multithreaded fashion due the processing required.
While the creation of the objects happens in separate threads (using Flask-SQLAlchemy extension btw with scoped sessions) the call to write the generated objects to the DB happens in 1 place after the generation has completed.
The problem, I believe, is that the objects being created are part of several existing relationships-- thereby triggering the automatic addition to the identity map despite being created in separate, concurrent, threads with no explicit session in any of the threads.
I was hoping to contain the generated objects in a single list, and then write the whole list (using a single session object) to the database. This results in an error like this:
AssertionError: A conflicting state is already present in the identity map for key (<class 'app.ModelObject'>, (1L,))
Hence why I believe the identity map has already been populated, because it's when I try to add and commit using the global session outside of the concurrent code, the assertion error is triggered.
The final detail is that whatever session object(s), (scoped or otherwise, as I don't fully understand how automatic addition to the identity map works in the case of multithreading) I cannot find a way / don't know how to get a reference to them so that even if I wanted to deal with a separate session per process I could.
Any advice is greatly appreciated. The only reason I am not posting code (yet) is because it's difficult to abstract a working example immediately out of my app. I will post if somebody really needs to see it though.
Each session is thread-local; in other words there is a separate session for each thread. If you decide to pass some instances to another thread, they will become "detached" from the session. Use db.session.add_all(objects) in the receiving thread to put them all back.
For some reason, it looks like you're creating objects with the same identity (primary key columns) in different threads, then trying to send them both to the database. One option is to fix why this is happening, so that identities will be guaranteed unique. You may also try merging; merged_object = db.session.merge(other_object, load=False).
Edit: zzzeek's comment clued me in on something else that may be going on:
With Flask-SQLAlchemy, the session is tied to the app context. Since that is thread local, spawning a new thread will invalidate the context; there will be no database session in the threads. All the instances are detached there, and cannot properly track relationships. One solution is to pass app to each thread and perform everything within a with app.app_context(): block. Inside the block, first use db.session.add to populate the local session with the passed instances. You should still merge in the master task afterwards to ensure consistency.
I just want to clarify the problem and the solution with some pseudo-code in case somebody has this problem / wants to do this in the future.
class ObjA(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjB(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjC(object):
obj_a_id = Column(Integer, ForeignKey('obj_a.id'))
obj_b_id = Column(Integer, ForeignKey('obj_b.id'))
def __init__(self, obj_a, obj_b):
self.obj_a = obj_a
self.obj_b = obj_b
def make_a_bunch_of_c(obj_a, list_of_b=None):
return [ObjC(obj_a, obj_b) for obj_b in list_of_b]
def parallel_generate():
list_of_a = session.query(ObjA).all() # assume there are 1000 of these
list_of_b = session.query(ObjB).all() # and 30 of these
fxn = functools.partial(make_a_bunch_of_c, list_of_b=list_of_b)
pool = multiprocessing.Pool(10)
all_the_things = pool.map(fxn, list_of_a)
return all_the_things
Now let's stop here a second. The original problem was that attempting to ADD the list of ObjC's caused the error message in the original question:
session.add_all(all_the_things)
AssertionError: A conflicting state is already present in the identity map for key [...]
Note: The error occurs during the adding phase, the commit attempt never even happens because the assertion occurs pre-commit. As far as I could tell.
Solution:
all_the_things = parallel_generate()
for thing in all_the_things:
session.merge(thing)
session.commit()
The details of session utilization when dealing with automatically added objects (via the relationship cascading) is still beyond me and I cannot explain why the conflict originally occurred. All I know is that using the merge function will cause SQLAlchemy to sort all of the child objects that were created across 10 different processes into a single session in the master process.
I would be curious in the why, if anyone happens across this.

SQLAlchemy: What's the difference between flush() and commit()?

What the difference is between flush() and commit() in SQLAlchemy?
I've read the docs, but am none the wiser - they seem to assume a pre-understanding that I don't have.
I'm particularly interested in their impact on memory usage. I'm loading some data into a database from a series of files (around 5 million rows in total) and my session is occasionally falling over - it's a large database and a machine with not much memory.
I'm wondering if I'm using too many commit() and not enough flush() calls - but without really understanding what the difference is, it's hard to tell!
A Session object is basically an ongoing transaction of changes to a database (update, insert, delete). These operations aren't persisted to the database until they are committed (if your program aborts for some reason in mid-session transaction, any uncommitted changes within are lost).
The session object registers transaction operations with session.add(), but doesn't yet communicate them to the database until session.flush() is called.
session.flush() communicates a series of operations to the database (insert, update, delete). The database maintains them as pending operations in a transaction. The changes aren't persisted permanently to disk, or visible to other transactions until the database receives a COMMIT for the current transaction (which is what session.commit() does).
session.commit() commits (persists) those changes to the database.
flush() is always called as part of a call to commit() (1).
When you use a Session object to query the database, the query will return results both from the database and from the flushed parts of the uncommitted transaction it holds. By default, Session objects autoflush their operations, but this can be disabled.
Hopefully this example will make this clearer:
#---
s = Session()
s.add(Foo('A')) # The Foo('A') object has been added to the session.
# It has not been committed to the database yet,
# but is returned as part of a query.
print 1, s.query(Foo).all()
s.commit()
#---
s2 = Session()
s2.autoflush = False
s2.add(Foo('B'))
print 2, s2.query(Foo).all() # The Foo('B') object is *not* returned
# as part of this query because it hasn't
# been flushed yet.
s2.flush() # Now, Foo('B') is in the same state as
# Foo('A') was above.
print 3, s2.query(Foo).all()
s2.rollback() # Foo('B') has not been committed, and rolling
# back the session's transaction removes it
# from the session.
print 4, s2.query(Foo).all()
#---
Output:
1 [<Foo('A')>]
2 [<Foo('A')>]
3 [<Foo('A')>, <Foo('B')>]
4 [<Foo('A')>]
This does not strictly answer the original question but some people have mentioned that with session.autoflush = True you don't have to use session.flush()... And this is not always true.
If you want to use the id of a newly created object in the middle of a transaction, you must call session.flush().
# Given a model with at least this id
class AModel(Base):
id = Column(Integer, primary_key=True) # autoincrement by default on integer primary key
session.autoflush = True
a = AModel()
session.add(a)
a.id # None
session.flush()
a.id # autoincremented integer
This is because autoflush does NOT auto fill the id (although a query of the object will, which sometimes can cause confusion as in "why this works here but not there?" But snapshoe already covered this part).
One related aspect that seems pretty important to me and wasn't really mentioned:
Why would you not commit all the time? - The answer is atomicity.
A fancy word to say: an ensemble of operations have to all be executed successfully OR none of them will take effect.
For example, if you want to create/update/delete some object (A) and then create/update/delete another (B), but if (B) fails you want to revert (A). This means those 2 operations are atomic.
Therefore, if (B) needs a result of (A), you want to call flush after (A) and commit after (B).
Also, if session.autoflush is True, except for the case that I mentioned above or others in Jimbo's answer, you will not need to call flush manually.
Use flush when you need to simulate a write, for example to get a primary key ID from an autoincrementing counter.
john=Person(name='John Smith', parent=None)
session.add(john)
session.flush()
son=Person(name='Bill Smith', parent=john.id)
Without flushing, john.id would be null.
Like others have said, without commit() none of this will be permanently persisted to DB.
Why flush if you can commit?
As someone new to working with databases and sqlalchemy, the previous answers - that flush() sends SQL statements to the DB and commit() persists them - were not clear to me. The definitions make sense but it isn't immediately clear from the definitions why you would use a flush instead of just committing.
Since a commit always flushes (https://docs.sqlalchemy.org/en/13/orm/session_basics.html#committing) these sound really similar. I think the big issue to highlight is that a flush is not permanent and can be undone, whereas a commit is permanent, in the sense that you can't ask the database to undo the last commit (I think)
#snapshoe highlights that if you want to query the database and get results that include newly added objects, you need to have flushed first (or committed, which will flush for you). Perhaps this is useful for some people although I'm not sure why you would want to flush rather than commit (other than the trivial answer that it can be undone).
In another example I was syncing documents between a local DB and a remote server, and if the user decided to cancel, all adds/updates/deletes should be undone (i.e. no partial sync, only a full sync). When updating a single document I've decided to simply delete the old row and add the updated version from the remote server. It turns out that due to the way sqlalchemy is written, order of operations when committing is not guaranteed. This resulted in adding a duplicate version (before attempting to delete the old one), which resulted in the DB failing a unique constraint. To get around this I used flush() so that order was maintained, but I could still undo if later the sync process failed.
See my post on this at: Is there any order for add versus delete when committing in sqlalchemy
Similarly, someone wanted to know whether add order is maintained when committing, i.e. if I add object1 then add object2, does object1 get added to the database before object2
Does SQLAlchemy save order when adding objects to session?
Again, here presumably the use of a flush() would ensure the desired behavior. So in summary, one use for flush is to provide order guarantees (I think), again while still allowing yourself an "undo" option that commit does not provide.
Autoflush and Autocommit
Note, autoflush can be used to ensure queries act on an updated database as sqlalchemy will flush before executing the query. https://docs.sqlalchemy.org/en/13/orm/session_api.html#sqlalchemy.orm.session.Session.params.autoflush
Autocommit is something else that I don't completely understand but it sounds like its use is discouraged:
https://docs.sqlalchemy.org/en/13/orm/session_api.html#sqlalchemy.orm.session.Session.params.autocommit
Memory Usage
Now the original question actually wanted to know about the impact of flush vs. commit for memory purposes. As the ability to persist or not is something the database offers (I think), simply flushing should be sufficient to offload to the database - although committing shouldn't hurt (actually probably helps - see below) if you don't care about undoing.
sqlalchemy uses weak referencing for objects that have been flushed: https://docs.sqlalchemy.org/en/13/orm/session_state_management.html#session-referencing-behavior
This means if you don't have an object explicitly held onto somewhere, like in a list or dict, sqlalchemy won't keep it in memory.
However, then you have the database side of things to worry about. Presumably flushing without committing comes with some memory penalty to maintain the transaction. Again, I'm new to this but here's a link that seems to suggest exactly this: https://stackoverflow.com/a/15305650/764365
In other words, commits should reduce memory usage, although presumably there is a trade-off between memory and performance here. In other words, you probably don't want to commit every single database change, one at a time (for performance reasons), but waiting too long will increase memory usage.
The existing answers don't make a lot of sense unless you understand what a database transaction is. (Twas the case for myself until recently.)
Sometimes you might want to run multiple SQL statements and have them succeed or fail as a whole. For example, if you want to execute a bank transfer from account A to account B, you'll need to do two queries like
UPDATE accounts SET value = value - 100 WHERE acct = 'A'
UPDATE accounts SET value = value + 100 WHERE acct = 'B'
If the first query succeeds but the second fails, this is bad (for obvious reasons). So, we need a way to treat these two queries "as a whole". The solution is to start with a BEGIN statement and end with either a COMMIT statement or a ROLLBACK statement, like
BEGIN
UPDATE accounts SET value = value - 100 WHERE acct = 'A'
UPDATE accounts SET value = value + 100 WHERE acct = 'B'
COMMIT
This is a single transaction.
In SQLAlchemy's ORM, this might look like
# BEGIN issued here
acctA = session.query(Account).get(1) # SELECT issued here
acctB = session.query(Account).get(2) # SELECT issued here
acctA.value -= 100
acctB.value += 100
session.commit() # UPDATEs and COMMIT issued here
If you monitor when the various queries get executed, you'll see the UPDATEs don't hit the database until you call session.commit().
In some situations you might want to execute the UPDATE statements before issuing a COMMIT. (Perhaps the database issues an auto-incrementing id to the object and you want to fetch it before COMMITing). In these cases, you can explicitly flush() the session.
# BEGIN issued here
acctA = session.query(Account).get(1) # SELECT issued here
acctB = session.query(Account).get(2) # SELECT issued here
acctA.value -= 100
acctB.value += 100
session.flush() # UPDATEs issued here
session.commit() # COMMIT issued here
commit () records these changes in the database. flush () is always called as part of the commit () (1) call. When you use a Session object to query a database, the query returns results from both the database and the reddened parts of the unrecorded transaction it is performing.
For simple orientation:
commit makes real changes (they become visible in the database)
flush makes fictive changes (they become visible just for you)
Imagine that databases work like git-branching.
First you have to understand that during a transaction you are not manipulating the real database data.
Instead, you get something like a new branch, and there you play around.
If at some point you write the command commit, that means: "merge my data-changes into main DB data".
But if you need some future data, that you can get only after commit (ex. insert into a table, and you need the inserted PKID), then you use the flush command, meaning: "calculate me the future PKID, and reserve it for me".
Then you can use that PKID value further in you code and be sure that the real data will be as expected.
Commit must always come at the end, to merge into main DB data.

Categories

Resources