I've been trying to test this out, but haven't been able to come to a definitive answer. I'm using SQLAlchemy on top of MySQL and trying to prevent having threads that do a select, get a SHARED_READ lock on some table, and then hold on to it (preventing future DDL operations until it's released). This happens when queries aren't committed. I'm using SQLAlchemy Core, where as far as I could tell .execute() essentially works in autocommit mode, issuing a COMMIT after everything it runs unless explicitly told we're in a transaction. Nevertheless, in show processlist, I'm seeing sleeping threads that still have SHARED_READ locks on a table they once queried. What gives?
Assuming from your post you're operating in "non-transactional" mode, either using an SQLAlchemy Connection without an ongoing transaction, or the shorthand engine.execute(). In this mode of operation SQLAlchemy will detect INSERT, UPDATE, DELETE, and DDL statements and issue a commit after automatically, but not for everything, like SELECT statements. See "Understanding Autocommit". For selects of mutating stored procedures and such that do require a commit, use
conn.execute(text('SELECT ...').execution_options(autocommit=True))
You should also consider closing connections when the thread is done with them for the time being. Closing will call rollback() on the underlying DBAPI connection, which per PEP-0249 is (probably) always in transactional state. This will remove the transactional state and/or locks, and returns the connection to the connection pool. This way you shouldn't need to worry about selects not autocommitting.
I have an SQLAlchemy session in a script. The script is running for a long time, and it only fetches data from database, never updates or inserts.
I get quite a lot of errors like
sqlalchemy.exc.DBAPIError: (TransactionRollbackError) terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.
The way I understand it, SQLAlchemy creates a transaction with the first select issued, and then reuses it. As my script may run for about an hour, it is very likely that a conflict comes up during the lifetime of that transaction.
To get rid of the error, I could use autocommit in te deprecated mode (without doing anything more), but this is explicitly discouraged by the documentation.
What is the right way to deal with the error? Can I use ORM queries without transactions at all?
I ended up closing the session after (almost) every select, like
session.query(Foo).all()
session.close()
since I do not use autocommit, a new transaction is automatically opened.
I want to detach an instance of a class from my session but it should still be available for reading (without emitting a query). I have been scanning through the documentation for days now, but every approach I try leads to the message
DetachedInstanceError: Instance <MyModel at 0x36bb190> is not bound to a Session;
attribute refresh operation cannot proceed
I am working with the zope.sqlalchemy transaction manager in Pyramid. I want my object to be usable after the transaction has been committed. I only need it to read the "cached" value, i.e. those that were in it before the transaction was commited.
The only possible solution I could find was to wrap the class (or the attributes itself) and then track the changes manually (I could do that but it is really ugly and not at all Pythonic).
Is there a way I can prevent SQLAlchemy from trying to refresh these values?
As a fallback I would even be open to just returning None, as long as the above error doesn't get thrown after the transaction as been committed
You can do exactly this (for example for caching) by doing:
session.expunge(obj)
As per sqlalchemy documentation:
http://docs.sqlalchemy.org/en/rel_1_0/orm/session_api.html?highlight=expire#sqlalchemy.orm.session.Session.expunge
This will give you object that is in detached state that you can safely use - you need to remember you will not be able to read properties that would emit another query thus being tied to session - like relationship, this will end up with DetachedInstanceError.
By default a commit will issue an expire_all() which means all objects will refresh their state on read, by expunging them you detach them from the session so there should be no subsequent queries after you commit your transaction.
I would advise against disabling this functionality globally as the other comments suggest as Mike Bayer generally suggest it is a good idea and a sane default for most people that can save you headaches in long run.
Just expunge things explictly when you need them.
http://docs.sqlalchemy.org/en/latest/orm/session_api.html
I think you're looking for expire_on_commit = False
I believe this allows you to detach the object and continue to use it. Trying to modify it and commit however will lead to the DetachedInstanceError.
try this:
DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension(), expire_on_commit=False))
I had this problem too and using expire_on_commit=False solved my problem.
#contextmanager
def make_session_scope(Session):
"""Provide a transactional scope around a series of operations."""
session = Session()
session.expire_on_commit = False
try:
yield session
session.commit()
except:
session.rollback()
raise
finally:
session.close()
with make_session_scope(session) as session:
query = session.query(model)
I am currently using scoped_session provided by sqlalchemy with autocommit=True and autoflush=True.
I notice that autoflush is no called properly as some of the updated results are not flushed when my script finishes executing.
Is autoflush not meant to be run with scoped_session in a multithreaded environment?
Is autoflush not meant to be run with scoped_session in a multithreaded environment?
there is no such restriction, no.
I notice that autoflush is no called properly as some of the updated results are not flushed when my script finishes executing.
This is a misunderstanding of autoflush. Autoflush is intended to flush pending data to the database before a query emits a SELECT to the database. It does not provide the feature however that data is flushed immediately as each attribute of an object is changed, as this would be very inefficient and is not feasible with any kind of ORM, unit of work or not. So if you modify a bunch of objects, then throw away the Session without further interaction with it, those pending changes are lost.
Autoflush is intended to be used within the context of a transaction. In its default mode of usage, the Session begins a transaction for you, and you only need call commit() when a series of changes are ready to be finalized. See the docs for background http://docs.sqlalchemy.org/en/rel_0_7/orm/session.html#flushing as well as the strong recommendations to avoid autocommit at http://docs.sqlalchemy.org/en/rel_0_7/orm/session.html#autocommit-mode .
What the difference is between flush() and commit() in SQLAlchemy?
I've read the docs, but am none the wiser - they seem to assume a pre-understanding that I don't have.
I'm particularly interested in their impact on memory usage. I'm loading some data into a database from a series of files (around 5 million rows in total) and my session is occasionally falling over - it's a large database and a machine with not much memory.
I'm wondering if I'm using too many commit() and not enough flush() calls - but without really understanding what the difference is, it's hard to tell!
A Session object is basically an ongoing transaction of changes to a database (update, insert, delete). These operations aren't persisted to the database until they are committed (if your program aborts for some reason in mid-session transaction, any uncommitted changes within are lost).
The session object registers transaction operations with session.add(), but doesn't yet communicate them to the database until session.flush() is called.
session.flush() communicates a series of operations to the database (insert, update, delete). The database maintains them as pending operations in a transaction. The changes aren't persisted permanently to disk, or visible to other transactions until the database receives a COMMIT for the current transaction (which is what session.commit() does).
session.commit() commits (persists) those changes to the database.
flush() is always called as part of a call to commit() (1).
When you use a Session object to query the database, the query will return results both from the database and from the flushed parts of the uncommitted transaction it holds. By default, Session objects autoflush their operations, but this can be disabled.
Hopefully this example will make this clearer:
#---
s = Session()
s.add(Foo('A')) # The Foo('A') object has been added to the session.
# It has not been committed to the database yet,
# but is returned as part of a query.
print 1, s.query(Foo).all()
s.commit()
#---
s2 = Session()
s2.autoflush = False
s2.add(Foo('B'))
print 2, s2.query(Foo).all() # The Foo('B') object is *not* returned
# as part of this query because it hasn't
# been flushed yet.
s2.flush() # Now, Foo('B') is in the same state as
# Foo('A') was above.
print 3, s2.query(Foo).all()
s2.rollback() # Foo('B') has not been committed, and rolling
# back the session's transaction removes it
# from the session.
print 4, s2.query(Foo).all()
#---
Output:
1 [<Foo('A')>]
2 [<Foo('A')>]
3 [<Foo('A')>, <Foo('B')>]
4 [<Foo('A')>]
This does not strictly answer the original question but some people have mentioned that with session.autoflush = True you don't have to use session.flush()... And this is not always true.
If you want to use the id of a newly created object in the middle of a transaction, you must call session.flush().
# Given a model with at least this id
class AModel(Base):
id = Column(Integer, primary_key=True) # autoincrement by default on integer primary key
session.autoflush = True
a = AModel()
session.add(a)
a.id # None
session.flush()
a.id # autoincremented integer
This is because autoflush does NOT auto fill the id (although a query of the object will, which sometimes can cause confusion as in "why this works here but not there?" But snapshoe already covered this part).
One related aspect that seems pretty important to me and wasn't really mentioned:
Why would you not commit all the time? - The answer is atomicity.
A fancy word to say: an ensemble of operations have to all be executed successfully OR none of them will take effect.
For example, if you want to create/update/delete some object (A) and then create/update/delete another (B), but if (B) fails you want to revert (A). This means those 2 operations are atomic.
Therefore, if (B) needs a result of (A), you want to call flush after (A) and commit after (B).
Also, if session.autoflush is True, except for the case that I mentioned above or others in Jimbo's answer, you will not need to call flush manually.
Use flush when you need to simulate a write, for example to get a primary key ID from an autoincrementing counter.
john=Person(name='John Smith', parent=None)
session.add(john)
session.flush()
son=Person(name='Bill Smith', parent=john.id)
Without flushing, john.id would be null.
Like others have said, without commit() none of this will be permanently persisted to DB.
Why flush if you can commit?
As someone new to working with databases and sqlalchemy, the previous answers - that flush() sends SQL statements to the DB and commit() persists them - were not clear to me. The definitions make sense but it isn't immediately clear from the definitions why you would use a flush instead of just committing.
Since a commit always flushes (https://docs.sqlalchemy.org/en/13/orm/session_basics.html#committing) these sound really similar. I think the big issue to highlight is that a flush is not permanent and can be undone, whereas a commit is permanent, in the sense that you can't ask the database to undo the last commit (I think)
#snapshoe highlights that if you want to query the database and get results that include newly added objects, you need to have flushed first (or committed, which will flush for you). Perhaps this is useful for some people although I'm not sure why you would want to flush rather than commit (other than the trivial answer that it can be undone).
In another example I was syncing documents between a local DB and a remote server, and if the user decided to cancel, all adds/updates/deletes should be undone (i.e. no partial sync, only a full sync). When updating a single document I've decided to simply delete the old row and add the updated version from the remote server. It turns out that due to the way sqlalchemy is written, order of operations when committing is not guaranteed. This resulted in adding a duplicate version (before attempting to delete the old one), which resulted in the DB failing a unique constraint. To get around this I used flush() so that order was maintained, but I could still undo if later the sync process failed.
See my post on this at: Is there any order for add versus delete when committing in sqlalchemy
Similarly, someone wanted to know whether add order is maintained when committing, i.e. if I add object1 then add object2, does object1 get added to the database before object2
Does SQLAlchemy save order when adding objects to session?
Again, here presumably the use of a flush() would ensure the desired behavior. So in summary, one use for flush is to provide order guarantees (I think), again while still allowing yourself an "undo" option that commit does not provide.
Autoflush and Autocommit
Note, autoflush can be used to ensure queries act on an updated database as sqlalchemy will flush before executing the query. https://docs.sqlalchemy.org/en/13/orm/session_api.html#sqlalchemy.orm.session.Session.params.autoflush
Autocommit is something else that I don't completely understand but it sounds like its use is discouraged:
https://docs.sqlalchemy.org/en/13/orm/session_api.html#sqlalchemy.orm.session.Session.params.autocommit
Memory Usage
Now the original question actually wanted to know about the impact of flush vs. commit for memory purposes. As the ability to persist or not is something the database offers (I think), simply flushing should be sufficient to offload to the database - although committing shouldn't hurt (actually probably helps - see below) if you don't care about undoing.
sqlalchemy uses weak referencing for objects that have been flushed: https://docs.sqlalchemy.org/en/13/orm/session_state_management.html#session-referencing-behavior
This means if you don't have an object explicitly held onto somewhere, like in a list or dict, sqlalchemy won't keep it in memory.
However, then you have the database side of things to worry about. Presumably flushing without committing comes with some memory penalty to maintain the transaction. Again, I'm new to this but here's a link that seems to suggest exactly this: https://stackoverflow.com/a/15305650/764365
In other words, commits should reduce memory usage, although presumably there is a trade-off between memory and performance here. In other words, you probably don't want to commit every single database change, one at a time (for performance reasons), but waiting too long will increase memory usage.
The existing answers don't make a lot of sense unless you understand what a database transaction is. (Twas the case for myself until recently.)
Sometimes you might want to run multiple SQL statements and have them succeed or fail as a whole. For example, if you want to execute a bank transfer from account A to account B, you'll need to do two queries like
UPDATE accounts SET value = value - 100 WHERE acct = 'A'
UPDATE accounts SET value = value + 100 WHERE acct = 'B'
If the first query succeeds but the second fails, this is bad (for obvious reasons). So, we need a way to treat these two queries "as a whole". The solution is to start with a BEGIN statement and end with either a COMMIT statement or a ROLLBACK statement, like
BEGIN
UPDATE accounts SET value = value - 100 WHERE acct = 'A'
UPDATE accounts SET value = value + 100 WHERE acct = 'B'
COMMIT
This is a single transaction.
In SQLAlchemy's ORM, this might look like
# BEGIN issued here
acctA = session.query(Account).get(1) # SELECT issued here
acctB = session.query(Account).get(2) # SELECT issued here
acctA.value -= 100
acctB.value += 100
session.commit() # UPDATEs and COMMIT issued here
If you monitor when the various queries get executed, you'll see the UPDATEs don't hit the database until you call session.commit().
In some situations you might want to execute the UPDATE statements before issuing a COMMIT. (Perhaps the database issues an auto-incrementing id to the object and you want to fetch it before COMMITing). In these cases, you can explicitly flush() the session.
# BEGIN issued here
acctA = session.query(Account).get(1) # SELECT issued here
acctB = session.query(Account).get(2) # SELECT issued here
acctA.value -= 100
acctB.value += 100
session.flush() # UPDATEs issued here
session.commit() # COMMIT issued here
commit () records these changes in the database. flush () is always called as part of the commit () (1) call. When you use a Session object to query a database, the query returns results from both the database and the reddened parts of the unrecorded transaction it is performing.
For simple orientation:
commit makes real changes (they become visible in the database)
flush makes fictive changes (they become visible just for you)
Imagine that databases work like git-branching.
First you have to understand that during a transaction you are not manipulating the real database data.
Instead, you get something like a new branch, and there you play around.
If at some point you write the command commit, that means: "merge my data-changes into main DB data".
But if you need some future data, that you can get only after commit (ex. insert into a table, and you need the inserted PKID), then you use the flush command, meaning: "calculate me the future PKID, and reserve it for me".
Then you can use that PKID value further in you code and be sure that the real data will be as expected.
Commit must always come at the end, to merge into main DB data.