Recently I came across strange behavior of SQLAlchemy regarding refreshing/populating model instances with the the changes that were made outside of the current session. I created the following minimal working example and was able to reproduce problem with it.
from time import sleep
from sqlalchemy import orm, create_engine, Column, BigInteger, Integer
from sqlalchemy.ext.declarative import declarative_base
DATABASE_URI = "postgresql://{user}:{password}#{host}:{port}/{name}".format(
user="postgres",
password="postgres",
host="127.0.0.1",
name="so_sqlalchemy",
port="5432",
)
class SQLAlchemy:
def __init__(self, db_url, autocommit=False, autoflush=True):
self.engine = create_engine(db_url)
self.session = None
self.autocommit = autocommit
self.autoflush = autoflush
def connect(self):
session_maker = orm.sessionmaker(
bind=self.engine,
autocommit=self.autocommit,
autoflush=self.autoflush,
expire_on_commit=True
)
self.session = orm.scoped_session(session_maker)
def disconnect(self):
self.session.flush()
self.session.close()
self.session.remove()
self.session = None
BaseModel = declarative_base()
class TestModel(BaseModel):
__tablename__ = "test_models"
id = Column(BigInteger, primary_key=True, nullable=False)
field = Column(Integer, nullable=False)
def loop(db):
while True:
with db.session.begin():
t = db.session.query(TestModel).with_for_update().get(1)
if t is None:
print("No entry in db, creating...")
t = TestModel(id=1, field=0)
db.session.add(t)
db.session.flush()
print(f"t.field value is {t.field}")
t.field += 1
print(f"t.field value before flush is {t.field}")
db.session.flush()
print(f"t.field value after flush is {t.field}")
print(f"t.field value after transaction is {t.field}")
print("Sleeping for 2 seconds.")
sleep(2.0)
def main():
db = SQLAlchemy(DATABASE_URI, autocommit=True, autoflush=True)
db.connect()
try:
loop(db)
except KeyboardInterrupt:
print("Canceled")
if __name__ == '__main__':
main()
My requirements.txt file looks like this:
alembic==1.0.10
psycopg2-binary==2.8.2
sqlalchemy==1.3.3
If I run the script (I use Python 3.7.3 on my laptop running Ubuntu 16.04), it will nicely increment a value every two seconds as expected:
t.field value is 0
t.field value before flush is 1
t.field value after flush is 1
t.field value after transaction is 1
Sleeping for 2 seconds.
t.field value is 1
t.field value before flush is 2
t.field value after flush is 2
t.field value after transaction is 2
Sleeping for 2 seconds.
...
Now at some point I open postgres database shell and begin another transaction:
so_sqlalchemy=# BEGIN;
BEGIN
so_sqlalchemy=# UPDATE test_models SET field=100 WHERE id=1;
UPDATE 1
so_sqlalchemy=# COMMIT;
COMMIT
As soon as I press Enter after the UPDATE query, the script blocks as expected, as I'm issuing SELECT ... FOR UPDATE query there. However, when I commit the transaction in the database shell, script continues from the previous value (say, 27) and does not detect that external transaction has changed the value of field in database to 100.
My question is, why does this happen at all? There are several factors that seem to contradict the current behavior:
I'm using expire_on_commit setting set to True, which seems to imply that every model instance that has been used in transaction will be marked as expired after the transaction has been committed. (Quoting documentation, "When True, all instances will be fully expired after each commit(), so that all attribute/object access subsequent to a completed transaction will load from the most recent database state.").
I'm not accessing some old model instance but rather issue completely new query every time. As far as I understand, this should lead to direct query to the database and not access cached instance. I can confirm that this is indeed the case if I turn sqlalchemy debug log on.
The quick and dirty fix for this problem is to call db.session.expire_all() right after the transaction has begun, but this seems very inelegant and counter-intuitive. I would be very glad to understand what's wrong with the way I'm working with sqlalchemy here.
I ran into a very similar situation with MySQL. I needed to "see" changes to the table that were coming from external sources in the middle of my code's database operations. I ended up having to set autocommit=True in my session call and use the begin() / commit() methods of the session to "see" data that was updated externally.
The SQLAlchemy docs say this is a legacy configuration:
Warning
“autocommit” mode is a legacy mode of use and should not be considered for new projects.
but also say in the next paragraph:
Modern usage of “autocommit mode” tends to be for framework integrations that wish to control specifically when the “begin” state occurs
So it doesn't seem to be clear which statement is correct.
Related
I have a python script that handles data transactions through sqlalchemy using:
def save_update(args):
session, engine = create_session(config["DATABASE"])
try:
instance = get_record(session)
if instance is None:
instance = create_record(session)
else:
instance = update_record(session, instance)
sync_errors(session, instance)
sync_expressions(session, instance)
sync_part(session, instance)
session.commit()
except:
session.rollback()
write_error(config)
raise
finally:
session.close()
On top of the data transactions, I have also some processing unrelated to the database - data preparation before I can do my data transaction. Those pre required tasks are taking some time so I wanted to execute multiples instances in parallel of this full script (data preparation + data transactions with sqlalchemy).
I am thus doing in a different script, simplified example here:
process1 = Thread(target=call_script, args=[["python", python_file_path,
"-xml", xml_path,
"-o", args.outputFolder,
"-l", log_path]])
process2 = Thread(target=call_script, args=[["python", python_file_path,
"-xml", xml_path,
"-o", args.outputFolder,
"-l", log_path]])
process1.start()
process2.start()
process1.join()
process2.join()
The target function "call_script" executes the firstly mentioned script above (data preparation + data transactions with sqlalchemy):
def call_script(args):
status = subprocess.call(args, shell=True)
print(status)
So now to summarize, I will for instance have 2 sub threads + the main one running. Each of those sub thread are executing sqlalchemy code in a separate process.
My question thus is should I be taking care of any specific considerations regarding the multi processing side of my code with sqlalchemy? For me the answer is no as this is multi processing and not multi threading exclusively due to the fact that use subprocess.call() to execute my code.
Now in reality, from time to time I kind of feel I have database locks during execution. Not sure if this is related to my code or someone else is hitting the database while I am processing it as well but I was expecting that each subprocess actually lock the database for when starting to do its work so that other subprocesses are thus waiting for current session to closes.
EDIT
I have used multi processing to replace multi threading for testing:
processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]
I still have same issue on which I got more details:
I see SQL Server is showing status "AWAITING COMMAND" and this only goes away when I kill the related python process executing the command.
I feel it appears when I am intensely parallelizing the sub processes but really not sure.
Thanks in advance for any support.
This is an interesting situation. It seems that maybe you can sidestep some of the manual process/thread handling and utilize something like multiprocessing's Pool. I made an example based on some other data initializing code I had. This delegates creating test data and inserting it for each of 10 "devices" to a pool of 3 processes. One caveat that seems necessary is to dispose of the engine before it is shared across fork(), ie. before the Pool tasks are created, this is mentioned here: engine-disposal
from random import randint
from datetime import datetime
from multiprocessing import Pool
from sqlalchemy import (
create_engine,
Integer,
DateTime,
String,
)
from sqlalchemy.schema import (
Column,
MetaData,
ForeignKey,
)
from sqlalchemy.orm import declarative_base, relationship, Session, backref
db_uri = 'postgresql+psycopg2://username:password#/database'
engine = create_engine(db_uri, echo=False)
metadata = MetaData()
Base = declarative_base(metadata=metadata)
class Event(Base):
__tablename__ = "events"
id = Column(Integer, primary_key=True, index=True)
created_on = Column(DateTime, nullable=False, index=True)
device_id = Column(Integer, ForeignKey('devices.id'), nullable=True)
device = relationship('Device', backref=backref("events"))
class Device(Base):
__tablename__ = "devices"
id = Column(Integer, primary_key=True, autoincrement=True)
name = Column(String(50))
def get_test_data(device_num):
""" Generate a test device and its test events for the given device number. """
device_dict = dict(name=f'device-{device_num}')
event_dicts = []
for day in range(1, 5):
for hour in range(0, 24):
for _ in range(0, randint(0, 50)):
event_dicts.append({
"created_on": datetime(day=day, month=1, year=2022, hour=hour),
})
return (device_dict, event_dicts)
def create_test_data(device_num):
""" Actually write the test data to the database. """
device_dict, event_dicts = get_test_data(device_num)
print (f"creating test data for {device_dict['name']}")
with Session(engine) as session:
device = Device(**device_dict)
session.add(device)
session.flush()
events = [Event(**event_dict) for event_dict in event_dicts]
event_count = len(events)
device.events.extend(events)
session.add_all(events)
session.commit()
return event_count
if __name__ == '__main__':
metadata.create_all(engine)
# Throw this away before fork.
engine.dispose()
# I have a 4-core processor, so I chose 3.
with Pool(3) as p:
print (p.map(create_test_data, range(0, 10)))
# Accessing engine here should still work
# but a new connection will be created.
with Session(engine) as session:
print (session.query(Event).count())
Output
creating test data for device-1
creating test data for device-0
creating test data for device-2
creating test data for device-3
creating test data for device-4
creating test data for device-5
creating test data for device-6
creating test data for device-7
creating test data for device-8
creating test data for device-9
[2511, 2247, 2436, 2106, 2244, 2464, 2358, 2512, 2267, 2451]
23596
I am answering my question as it did not relate to SQLAlchemy at all in the end.
When executing:
processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]
On a specific batch and for no direct reasons, one of the subprocess was not exiting properly although the script it called was arriving at the end.
I searched and saw that this is an issue of using p.wait() with Popen having shell=True.
I set shell=False and used Pipes to stdout and stderr and also added a sys.exit(0) at the end of the python script being executed by the subprocess to make sure it terminates properly the execution.
Hope it can help someone else! Thanks Ian for your support.
I have a 1,000,000 records that I am trying to enter to the database, some of the records unfortunately are not standing with the db schema. At the moment when a record failed I am doing:
rollback to the database
observer the exception
fix the issue
run again.
I wish to build a script which would save a side all "bad" records but would commit all the correct ones.
Of course I can commit one by one and then when the commit fail rollback and commit the next but I would pay a "run time price" as the code would run for a long time.
In the example below i have two models: File and Client.The a relation one client has many files.
In the commit.py file i wish to commit 1M File objects at once or at batches (1k). at the moment I only understand when something failed when i commit at the end, is there a way to know which object are "bad" ( Integrity errors with the foreign key as example) before, i.e park a side ( in another list) but committing all the "good"
thx a lot for the help
#model.py
from sqlalchemy import Column, DateTime, String, func, Integer, ForeignKey
from . import base
class Client(base):
__tablename__ = 'clients'
id = Column(String, primary_key=True)
class File(base):
__tablename__ = 'files'
id = Column(Integer, primary_key=True, autoincrement=True)
client_id = Column(String, ForeignKey('clients.id'))
path = Column(String)
#init.py
import os
from dotenv import load_dotenv
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
load_dotenv()
db_name = os.getenv("DB_NAME")
db_host = os.getenv("DB_HOST")
db_port = os.getenv("DB_PORT")
db_user = os.getenv("DB_USER")
db_password = os.getenv("DB_PASSWORD")
db_uri = 'postgresql://' + db_user + ':' + db_password + '#' + db_host + ':' + db_port + '/' + db_name
print(f"product_domain: {db_uri}")
base = declarative_base()
engine = create_engine(db_uri)
base.metadata.bind = engine
Session = sessionmaker(bind=engine)
session = Session()
conn = engine.connect()
#commit.py
from . import session
def commit(list_of_1m_File_objects_records):
#I wish to for loop over the rows and if a specific row rasie excaption to insert it to a list and handle after wards
for file in list_of_1m_File_objects_records:
session.add(file)
session.commit()
# client:
# id
# "a"
# "b"
# "c"
# file:
# id|client_id|path
# --|---------|-------------
# 1 "a" "path1.txt"
# 2 "aa" "path2.txt"
# 3 "c" "path143.txt"
# 4 "a" "pat.txt"
# 5 "b" "patb.txt"
# wish the file data would enter the database although it has one record "aa" which will raise integrity error
Since I can't comment, I would suggest to use psycopg2 and sqlAlchemy to generate the connection with the db and then use a query with "On conflict" at the end of the query to add and commit your data
Of course I can commit one by one and then when the commit fail rollback and commit the next but I would pay a "run time price" as the code would run for a long time.
What is the source of that price? If it is fsync speed, you can get rid of most of that cost by setting synchronous_commit to off on the local connection. If you have a crash part way through, then you need to figure out which ones had actually been recorded once it comes back up so you know where to start up again, but I wouldn't think that that would be hard to do. This method should get you most benefit for the least work.
at the moment I only understand when something failed when i commit at the end
It sounds like you are using deferred constraints. Is there a reason for that?
is there a way to know which object are "bad" ( Integrity errors with the foreign key as example)
For the case of that example, read all the Client ids into a dictionary before you start (assuming they will fit in RAM) then test Files on the python side so you can reject the orphans before trying to insert them.
Summary
I'm trying write integration tests against a series of database operations, and I want to be able to use a SQLAlchemy session as a staging environment in which to validate and rollback a transaction.
Is it possible to retrieve uncommitted data using session.query(Foo) instead of session.execute(text('select * from foo'))?
Background and Research
These results were observed using SQLAlchemy 1.2.10, Python 2.7.13, and Postgres 9.6.11.
I've looked at related StackOverflow posts but haven't found an explanation as to why the two operations below should behave differently.
SQLalchemy: changes not committing to db
Tried with and without session.flush() before every session.query. No success.
sqlalchemy update not commiting changes to database. Using single connection in an app
Checked to make sure I am using the same session object throughout
Sqlalchemy returns different results of the SELECT command (query.all)
N/A: My target workflow is to assess a series of CRUD operations within the staging tables of a single session.
Querying objects added to a non committed session in SQLAlchemy
Seems to be the most related issue, but my motivation for avoiding session.commit() is different, and I didn't quite find the explanation I'm looking for.
Reproducible Example
1) I establish a connection to the database and define a model object; no issues so far:
from sqlalchemy import text
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, ForeignKey
#####
# Prior DB setup:
# CREATE TABLE foo (id int PRIMARY KEY, label text);
#####
# from https://docs.sqlalchemy.org/en/13/orm/mapping_styles.html#declarative-mapping
Base = declarative_base()
class Foo(Base):
__tablename__ = 'foo'
id = Column(Integer, primary_key=True)
label = Column(String)
# from https://docs.sqlalchemy.org/en/13/orm/session_basics.html#getting-a-session
some_engine = create_engine('postgresql://username:password#endpoint/database')
Session = sessionmaker(bind=some_engine)
2) I perform some updates without committing the result, and I can see the staged data by executing a select statement within the session:
session = Session()
sql_insert = text("INSERT INTO foo (id, label) VALUES (1, 'original')")
session.execute(sql_insert);
sql_read = text("SELECT * FROM foo WHERE id = 1");
res = session.execute(sql_read).first()
print res.label
sql_update = text("UPDATE foo SET label = 'updated' WHERE id = 1")
session.execute(sql_update)
res2 = session.execute(sql_read).first()
print res2.label
sql_update2 = text("""
INSERT INTO foo (id, label) VALUES (1, 'second_update')
ON CONFLICT (id) DO UPDATE
SET (label) = (EXCLUDED.label)
""")
session.execute(sql_update2)
res3 = session.execute(sql_read).first()
print res3.label
session.rollback()
# prints expected values: 'original', 'updated', 'second_update'
3) I attempt to replace select statements with session.query, but I can't see the new data:
session = Session()
sql_insert = text("INSERT INTO foo (id, label) VALUES (1, 'original')")
session.execute(sql_insert);
res = session.query(Foo).filter_by(id=1).first()
print res.label
sql_update = text("UPDATE foo SET label = 'updated' WHERE id = 1")
session.execute(sql_update)
res2 = session.query(Foo).filter_by(id=1).first()
print res2.label
sql_update2 = text("""
INSERT INTO foo (id, label) VALUES (1, 'second_update')
ON CONFLICT (id) DO UPDATE
SET (label) = (EXCLUDED.label)
""")
session.execute(sql_update2)
res3 = session.query(Foo).filter_by(id=1).first()
print res3.label
session.rollback()
# prints: 'original', 'original', 'original'
I expect the printed output of Step 3 to be 'original', 'updated', 'second_update'.
The root cause is that the raw SQL queries and the ORM do not mix automatically in this case. While the Session is not a cache, meaning it does not cache queries, it does store objects based on their primary key in the identity map. When a Query returns a row for a mapped object, the existing object is returned. This is why you do not observe the changes you made in the 3rd step. This might seem like a rather poor way to handle the situation, but SQLAlchemy is operating based on some assumptions about transaction isolation, as described in "When to Expire or Refresh":
Transaction Isolation
...[So] as a best guess, it assumes that within the scope of a transaction, unless it is known that a SQL expression has been emitted to modify a particular row, there’s no need to refresh a row unless explicitly told to do so.
The whole note about transaction isolation is a worthwhile read. The way to make such changes known to SQLAlchemy is to perform updates using the Query API, if possible, and to manually expire changed objects, if all else fails. With this in mind, your 3rd step could look like:
session = Session()
sql_insert = text("INSERT INTO foo (id, label) VALUES (1, 'original')")
session.execute(sql_insert);
res = session.query(Foo).filter_by(id=1).first()
print(res.label)
session.query(Foo).filter_by(id=1).update({Foo.label: 'updated'},
synchronize_session='fetch')
# This query is actually redundant, `res` and `res2` are the same object
res2 = session.query(Foo).filter_by(id=1).first()
print(res2.label)
sql_update2 = text("""
INSERT INTO foo (id, label) VALUES (1, 'second_update')
ON CONFLICT (id) DO UPDATE
SET label = EXCLUDED.label
""")
session.execute(sql_update2)
session.expire(res)
# Again, this query is redundant and fetches the same object that needs
# refreshing anyway
res3 = session.query(Foo).filter_by(id=1).first()
print(res3.label)
session.rollback()
I have a Pyramid / SQLAlchemy, MySQL python app.
When I execute a raw SQL INSERT query, nothing gets written to the DB.
When using ORM, however, I can write to the DB. I read the docs, I read up about the ZopeTransactionExtension, read a good deal of SO questions, all to no avail.
What hasn't worked so far:
transaction.commit() - nothing is written to the DB. I do realize this statement is necessary with ZopeTransactionExtension but it just doesn't do the magic here.
dbsession().commit - doesn't work since I'm using ZopeTransactionExtension
dbsession().close() - nothing written
dbsession().flush() - nothing written
mark_changed(session) -
File "/home/dev/.virtualenvs/sc/local/lib/python2.7/site-packages/zope/sqlalchemy/datamanager.py", line 198, in join_transaction
if session.twophase:
AttributeError: 'scoped_session' object has no attribute 'twophase'"
What has worked but is not acceptable because it doesn't use scoped_session:
engine.execute(...)
I'm looking for how to execute raw SQL with a scoped_session (dbsession() in my code)
Here is my SQLAlchemy setup (models/__init__.py)
def dbsession():
assert (_dbsession is not None)
return _dbsession
def init_engines(settings, _testing_workarounds=False):
import zope.sqlalchemy
extension = zope.sqlalchemy.ZopeTransactionExtension()
global _dbsession
_dbsession = scoped_session(
sessionmaker(
autoflush=True,
expire_on_commit=False,
extension=extension,
)
)
engine = engine_from_config(settings, 'sqlalchemy.')
_dbsession.configure(bind=engine)
Here is a python script I wrote to isolate the problem. It resembles the real-world environment of where the problem occurs. All I want is to make the below script insert the data into the DB:
# -*- coding: utf-8 -*-
import sys
import transaction
from pyramid.paster import setup_logging, get_appsettings
from sc.models import init_engines, dbsession
from sqlalchemy.sql.expression import text
def __main__():
if len(sys.argv) < 2:
raise RuntimeError()
config_uri = sys.argv[1]
setup_logging(config_uri)
aa = init_engines(get_appsettings(config_uri))
session = dbsession()
session.execute(text("""INSERT INTO
operations (description, generated_description)
VALUES ('hello2', 'world');"""))
print list(session.execute("""SELECT * from operations""").fetchall()) # prints inserted data
transaction.commit()
print list(session.execute("""SELECT * from operations""").fetchall()) # doesn't print inserted data
if __name__ == '__main__':
__main__()
What is interesting, if I do:
session = dbsession()
session.execute(text("""INSERT INTO
operations (description, generated_description)
VALUES ('hello2', 'world');"""))
op = Operation(generated_description='aa', description='oo')
session.add(op)
then the first print outputs the raw SQL inserted row ('hello2' 'world'), and the second print prints both rows, and in fact both rows are inserted into the DB.
I cannot comprehend why using an ORM insert alongside raw SQL "fixes" it.
I really need to be able to call execute() on a scoped_session to insert data into the DB using raw SQL. Any advice?
It has been a while since I mixed raw sql with sqlalchemy, but whenever you mix them, you need to be aware of what happens behind the scenes with the ORM. First, check the autocommit flag. If the zope transaction is not configured correctly, the ORM insert might be triggering a commit.
Actually, after looking at the zope docs, it seems manual execute statements need an extra step. From their readme:
By default, zope.sqlalchemy puts sessions in an 'active' state when they are
first used. ORM write operations automatically move the session into a
'changed' state. This avoids unnecessary database commits. Sometimes it
is necessary to interact with the database directly through SQL. It is not
possible to guess whether such an operation is a read or a write. Therefore we
must manually mark the session as changed when manual SQL statements write
to the DB.
>>> session = Session()
>>> conn = session.connection()
>>> users = Base.metadata.tables['test_users']
>>> conn.execute(users.update(users.c.name=='bob'), name='ben')
<sqlalchemy.engine...ResultProxy object at ...>
>>> from zope.sqlalchemy import mark_changed
>>> mark_changed(session)
>>> transaction.commit()
>>> session = Session()
>>> str(session.query(User).all()[0].name)
'ben'
>>> transaction.abort()
It seems you aren't doing that, and so the transaction.commit does nothing.
I have the following function in python:
def add_odm_object(obj, table_name, primary_key, unique_column):
db = create_engine('mysql+pymysql://root:#127.0.0.1/mydb')
metadata = MetaData(db)
t = Table(table_name, metadata, autoload=True)
s = t.select(t.c[unique_column] == obj[unique_column])
rs = s.execute()
r = rs.fetchone()
if not r:
i = t.insert()
i_res = i.execute(obj)
v_id = i_res.inserted_primary_key[0]
return v_id
else:
return r[primary_key]
This function looks if the object obj is in the database, and if it is not found, it saves it to the DB. Now, I have a problem. I call the above function in a loop many times. And after few hundred times, I get an error: user root has exceeded the max_user_connections resource (current value: 30) I tried to search for answers and for example the question: How to close sqlalchemy connection in MySQL recommends creating a conn = db.connect() object where dbis the engine and calling conn.close() after my query is completed.
But, where should I open and close the connection in my code? I am not working with the connection directly, but I'm using the Table() and MetaData functions in my code.
The engine is an expensive-to-create factory for database connections. Your application should call create_engine() exactly once per database server.
Similarly, the MetaData and Table objects describe a fixed schema object within a known database. These are also configurational constructs that in most cases are created once, just like classes, in a module.
In this case, your function seems to want to load up tables dynamically, which is fine; the MetaData object acts as a registry, which has the convenience feature that it will give you back an existing table if it already exists.
Within a Python function and especially within a loop, for best performance you typically want to refer to a single database connection only.
Taking these things into account, your module might look like:
# module level variable. can be initialized later,
# but generally just want to create this once.
db = create_engine('mysql+pymysql://root:#127.0.0.1/mydb')
# module level MetaData collection.
metadata = MetaData()
def add_odm_object(obj, table_name, primary_key, unique_column):
with db.begin() as connection:
# will load table_name exactly once, then store it persistently
# within the above MetaData
t = Table(table_name, metadata, autoload=True, autoload_with=conn)
s = t.select(t.c[unique_column] == obj[unique_column])
rs = connection.execute(s)
r = rs.fetchone()
if not r:
i_res = connection.execute(t.insert(), some_col=obj)
v_id = i_res.inserted_primary_key[0]
return v_id
else:
return r[primary_key]