Multi Processing with sqlalchemy

Multi Processing with sqlalchemy - python

I have a python script that handles data transactions through sqlalchemy using:
def save_update(args):
session, engine = create_session(config["DATABASE"])
try:
instance = get_record(session)
if instance is None:
instance = create_record(session)
else:
instance = update_record(session, instance)
sync_errors(session, instance)
sync_expressions(session, instance)
sync_part(session, instance)
session.commit()
except:
session.rollback()
write_error(config)
raise
finally:
session.close()
On top of the data transactions, I have also some processing unrelated to the database - data preparation before I can do my data transaction. Those pre required tasks are taking some time so I wanted to execute multiples instances in parallel of this full script (data preparation + data transactions with sqlalchemy).
I am thus doing in a different script, simplified example here:
process1 = Thread(target=call_script, args=[["python", python_file_path,
"-xml", xml_path,
"-o", args.outputFolder,
"-l", log_path]])
process2 = Thread(target=call_script, args=[["python", python_file_path,
"-xml", xml_path,
"-o", args.outputFolder,
"-l", log_path]])
process1.start()
process2.start()
process1.join()
process2.join()
The target function "call_script" executes the firstly mentioned script above (data preparation + data transactions with sqlalchemy):
def call_script(args):
status = subprocess.call(args, shell=True)
print(status)
So now to summarize, I will for instance have 2 sub threads + the main one running. Each of those sub thread are executing sqlalchemy code in a separate process.
My question thus is should I be taking care of any specific considerations regarding the multi processing side of my code with sqlalchemy? For me the answer is no as this is multi processing and not multi threading exclusively due to the fact that use subprocess.call() to execute my code.
Now in reality, from time to time I kind of feel I have database locks during execution. Not sure if this is related to my code or someone else is hitting the database while I am processing it as well but I was expecting that each subprocess actually lock the database for when starting to do its work so that other subprocesses are thus waiting for current session to closes.
EDIT
I have used multi processing to replace multi threading for testing:
processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]
I still have same issue on which I got more details:
I see SQL Server is showing status "AWAITING COMMAND" and this only goes away when I kill the related python process executing the command.
I feel it appears when I am intensely parallelizing the sub processes but really not sure.
Thanks in advance for any support.

This is an interesting situation. It seems that maybe you can sidestep some of the manual process/thread handling and utilize something like multiprocessing's Pool. I made an example based on some other data initializing code I had. This delegates creating test data and inserting it for each of 10 "devices" to a pool of 3 processes. One caveat that seems necessary is to dispose of the engine before it is shared across fork(), ie. before the Pool tasks are created, this is mentioned here: engine-disposal
from random import randint
from datetime import datetime
from multiprocessing import Pool
from sqlalchemy import (
create_engine,
Integer,
DateTime,
String,
)
from sqlalchemy.schema import (
Column,
MetaData,
ForeignKey,
)
from sqlalchemy.orm import declarative_base, relationship, Session, backref
db_uri = 'postgresql+psycopg2://username:password#/database'
engine = create_engine(db_uri, echo=False)
metadata = MetaData()
Base = declarative_base(metadata=metadata)
class Event(Base):
__tablename__ = "events"
id = Column(Integer, primary_key=True, index=True)
created_on = Column(DateTime, nullable=False, index=True)
device_id = Column(Integer, ForeignKey('devices.id'), nullable=True)
device = relationship('Device', backref=backref("events"))
class Device(Base):
__tablename__ = "devices"
id = Column(Integer, primary_key=True, autoincrement=True)
name = Column(String(50))
def get_test_data(device_num):
""" Generate a test device and its test events for the given device number. """
device_dict = dict(name=f'device-{device_num}')
event_dicts = []
for day in range(1, 5):
for hour in range(0, 24):
for _ in range(0, randint(0, 50)):
event_dicts.append({
"created_on": datetime(day=day, month=1, year=2022, hour=hour),
})
return (device_dict, event_dicts)
def create_test_data(device_num):
""" Actually write the test data to the database. """
device_dict, event_dicts = get_test_data(device_num)
print (f"creating test data for {device_dict['name']}")
with Session(engine) as session:
device = Device(**device_dict)
session.add(device)
session.flush()
events = [Event(**event_dict) for event_dict in event_dicts]
event_count = len(events)
device.events.extend(events)
session.add_all(events)
session.commit()
return event_count
if __name__ == '__main__':
metadata.create_all(engine)
# Throw this away before fork.
engine.dispose()
# I have a 4-core processor, so I chose 3.
with Pool(3) as p:
print (p.map(create_test_data, range(0, 10)))
# Accessing engine here should still work
# but a new connection will be created.
with Session(engine) as session:
print (session.query(Event).count())
Output
creating test data for device-1
creating test data for device-0
creating test data for device-2
creating test data for device-3
creating test data for device-4
creating test data for device-5
creating test data for device-6
creating test data for device-7
creating test data for device-8
creating test data for device-9
[2511, 2247, 2436, 2106, 2244, 2464, 2358, 2512, 2267, 2451]
23596

I am answering my question as it did not relate to SQLAlchemy at all in the end.
When executing:
processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]
On a specific batch and for no direct reasons, one of the subprocess was not exiting properly although the script it called was arriving at the end.
I searched and saw that this is an issue of using p.wait() with Popen having shell=True.
I set shell=False and used Pipes to stdout and stderr and also added a sys.exit(0) at the end of the python script being executed by the subprocess to make sure it terminates properly the execution.
Hope it can help someone else! Thanks Ian for your support.

Related

SQLAlchemy inserts row to MYSQL database, then removed 3-5 seconds after. No traceback. Python3

At the moment, I am using Session 'With' Managers with two scripts running in two docker containers, connected to a third mysql docker container. They are running at the same time, polling for updates.
There are multiple of these in both scripts:
with Session(bind=engine, expire_on_commit=True) as session:
with session.begin():
check_for_sheet_updates(session)
I'm not doing any session.commit()s, only session.add()s.
It's such weird behaviour because one row gets added fine, and that remains in the database. However, when I add another row, the second row gets removed after 4-5 seconds automatically. It shows when I run SELECT * FROM sheet_instance; for about 4-5 seconds then vanishes.
I have tried session.flush() after and before adds. Nothing changes.
A new session is made on the first script whenever I make a Telegram Request to the script, and closes when it's finished the process via the 'with' statement.
No errors are made when session.add() is called on the second row.
A new session is made on the second script every 3-10 seconds (which could be main cause).
I have tried everything I can. It's pretty obvious i'm not sure how Sessions work. If someone could help me it would be appreciated.
there is also a third file where I import database engine from:
from email.policy import default
from sqlalchemy.orm import sessionmaker, Session, declarative_base, relationship
from sqlalchemy import create_engine, MetaData, Table, Column, Integer, String, Boolean, inspect
from sqlalchemy_utils import database_exists, create_database
import os
import time
import datetime
SQL_BINANCE_USERNAME = os.getenv('SQL_BINANCE_USERNAME')
SQL_BINANCE_PASSWORD = os.getenv('SQL_BINANCE_PASSWORD')
mysql_conn_str = f"mysql+pymysql://{SQL_BINANCE_USERNAME}:{SQL_BINANCE_PASSWORD}#binance_bot_creator_net:3306"
engine = create_engine(mysql_conn_str)
Base = declarative_base()
connection = engine.connect()
connection.execute("CREATE DATABASE IF NOT EXISTS instances")
connection.execute("commit")
mysql_conn_str = f"{mysql_conn_str}/instances?charset=utf8mb4"
engine = create_engine(mysql_conn_str, pool_pre_ping=True, pool_size=10, max_overflow=15, pool_timeout=30)
Base = declarative_base()
connection = engine.connect()
sessionMade = sessionmaker(bind=engine)
class Sheet_Instance(Base):
__tablename__ = "sheet_instance"
id = Column(Integer, primary_key = True)
api_key = Column(String(128))
api_secret = Column(String(128))
gid = Column(String(128))
sheet_name = Column(String(128))
sheet_name_lower = Column(String(128))
symbol = Column(String(128))
active = Column(Boolean(), default=False)
notification_chat_id = Column(String(128), default="-1001768606486")
Base.metadata.create_all(engine)
On the two script files, I import engine and Sheet_Instance from the file above.
I'm not sure what to do.

How to commit multiple records at once the fastest way using postgres\sqlalchemy?

I have a 1,000,000 records that I am trying to enter to the database, some of the records unfortunately are not standing with the db schema. At the moment when a record failed I am doing:
rollback to the database
observer the exception
fix the issue
run again.
I wish to build a script which would save a side all "bad" records but would commit all the correct ones.
Of course I can commit one by one and then when the commit fail rollback and commit the next but I would pay a "run time price" as the code would run for a long time.
In the example below i have two models: File and Client.The a relation one client has many files.
In the commit.py file i wish to commit 1M File objects at once or at batches (1k). at the moment I only understand when something failed when i commit at the end, is there a way to know which object are "bad" ( Integrity errors with the foreign key as example) before, i.e park a side ( in another list) but committing all the "good"
thx a lot for the help
#model.py
from sqlalchemy import Column, DateTime, String, func, Integer, ForeignKey
from . import base
class Client(base):
__tablename__ = 'clients'
id = Column(String, primary_key=True)
class File(base):
__tablename__ = 'files'
id = Column(Integer, primary_key=True, autoincrement=True)
client_id = Column(String, ForeignKey('clients.id'))
path = Column(String)
#init.py
import os
from dotenv import load_dotenv
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
load_dotenv()
db_name = os.getenv("DB_NAME")
db_host = os.getenv("DB_HOST")
db_port = os.getenv("DB_PORT")
db_user = os.getenv("DB_USER")
db_password = os.getenv("DB_PASSWORD")
db_uri = 'postgresql://' + db_user + ':' + db_password + '#' + db_host + ':' + db_port + '/' + db_name
print(f"product_domain: {db_uri}")
base = declarative_base()
engine = create_engine(db_uri)
base.metadata.bind = engine
Session = sessionmaker(bind=engine)
session = Session()
conn = engine.connect()
#commit.py
from . import session
def commit(list_of_1m_File_objects_records):
#I wish to for loop over the rows and if a specific row rasie excaption to insert it to a list and handle after wards
for file in list_of_1m_File_objects_records:
session.add(file)
session.commit()
# client:
# id
# "a"
# "b"
# "c"
# file:
# id|client_id|path
# --|---------|-------------
# 1 "a" "path1.txt"
# 2 "aa" "path2.txt"
# 3 "c" "path143.txt"
# 4 "a" "pat.txt"
# 5 "b" "patb.txt"
# wish the file data would enter the database although it has one record "aa" which will raise integrity error

Since I can't comment, I would suggest to use psycopg2 and sqlAlchemy to generate the connection with the db and then use a query with "On conflict" at the end of the query to add and commit your data

Of course I can commit one by one and then when the commit fail rollback and commit the next but I would pay a "run time price" as the code would run for a long time.
What is the source of that price? If it is fsync speed, you can get rid of most of that cost by setting synchronous_commit to off on the local connection. If you have a crash part way through, then you need to figure out which ones had actually been recorded once it comes back up so you know where to start up again, but I wouldn't think that that would be hard to do. This method should get you most benefit for the least work.
at the moment I only understand when something failed when i commit at the end
It sounds like you are using deferred constraints. Is there a reason for that?
is there a way to know which object are "bad" ( Integrity errors with the foreign key as example)
For the case of that example, read all the Client ids into a dictionary before you start (assuming they will fit in RAM) then test Files on the python side so you can reject the orphans before trying to insert them.

How to create a database connect engine in each Dask sub process to parallel thousands of sql query, without recreating engine in every query

I need to embarrassingly parallel the fetch job for thousands of sql query from database.
Here is the simplified example.
##Env info: python=3.7 postgresql=10 dask=latest
##generate the example db table.
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
engine = create_engine('postgresql://dbadmin:dbadmin#server:5432/db01')
data = pd.DataFrame(np.random.randint(0,100 , size=(30000,5)),columns=['a','b','c','d','e'])
data.to_sql('tablename',engine,index=True,if_exists='append')
First, this is the basic example without dask parallel.
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
engine = create_engine('postgresql://dbadmin:dbadmin#server:5432/db01')
def job(indexstr):
'send the query, fetch the data, do some calculate and return'
sql='select * from public.tablename where index='+indexstr
df=pd.read_sql_query(sql, engine, index_col='index',)
##get the data and do some analysis.
return np.sum(df.values)
for v in range(1000):
lists.append(job(str(v)))
### wall time:17s
It's not as fast as we image since both the database query and data analysis might cost time and there are more idle cpu.
Then I try to use dask to parallel it like this.
def jobWithEngine(indexstr):
`engine cannot be serialized between processes thus create each own.`
engine = create_engine('postgresql://dbadmin:dbadmin#server:5432/db01')
sql='select * from public.tablename where index='+indexstr
df=pd.read_sql_query(sql, engine, index_col='index',)
return np.sum(df.values)
import dask
dask.config.set(scheduler='processes')
import dask.bag as db
dbdata=db.from_sequence([str(v) for v in range(1000)])
dbdata=dbdata.map(lambda x:jobWithEngine(x))
results_bag = dbdata.compute()
###Wall time:1min8s
The problem is that I find the engine creation takes more time and there are thousands of it.
It will be recreated in every sql query which is really costly and it might crash the database service!
So I guess it must be more elegant way like this:
import dask
dask.config.set(scheduler='processes')
import dask.bag as db
dbdata=db.from_sequence([str(v) for v in range(1000)])
dbdata=dbdata.map(lambda x:job(x,init=create_engine))
results_bag = dbdata.compute()
1.The main process create 8 sub process.
2.Each process create its own engine to initialize the job preparation.
3.Then main process send them 1000 jobs and get the 1000 return.
4.After all is done, sub process engine is stopped and kill the sub process.
Or the dask have already done this and the additional time comes from communications between process?

You can do this by setting a connected database as a variable for each worker using get_worker
from dask.distributed import get_worker
def connect_worker_db(db):
worker = get_worker()
worker.db = db # DB settings, password, username etc
worker.db.connect() # Function that connects the database, e.g. create_engine()
Then have the client run the connect_worker_db:
from dask.distributed import Client, get_worker
client = Client()
client.run(connect_worker_db, db)
For the function using the connection, like jobWithEngine(), you have to get the worker and use the parameter you have saved it to:
def jobWithEngine():
db = get_worker().db
Then make sure to disconnect at the end:
def disconnect_worker_db():
worker = get_worker()
worker.db.disconnect()
client.run(disconnect_worker_db)

Amy's answer has the benefit of being simple, but if for any reason dask starts new workers, they will not have .db.
I don't know when first introduced, but Dask 1.12.2 has a Client.register_worker_callbacks which takes a function as a parameter intended for this kind of use. If this callback takes a param called dask_worker then worker itself will be passed.
def main():
dask_client = dask.distributed.Client(cluster)
db = dict(
host="db-host",
username="user",
# etc etc
)
def worker_setup(dask_worker: dask.distributed.Worker):
dask_worker.db = db
dask_client.register_worker_callbacks(worker_setup)
https://distributed.dask.org/en/latest/api.html#distributed.Client.register_worker_callbacks
However, this doesn't close the db connections at the end. You probably will be covered with client.run(disconnect_worker_db) but I have seen some workers not releasing their resources. Fixing this in a more comprehensive manner needs a bit more code as per https://distributed.dask.org/en/latest/api.html#distributed.Client.register_worker_plugin
class MyWorkerPlugin(dask.distributed.WorkerPlugin):
def __init__(self, *args, **kwargs):
self.db = kwargs.get("db")
assert self.db, "no db"
def setup(self, worker: dask.distributed.Worker):
worker.db = self.db
def teardown(self, worker: dask.distributed.Worker):
print(f"worker {worker.name} teardown")
# eg db.disconnect()
def main():
cluster = dask.distributed.LocalCluster(
n_workers=os.cpu_count(),
threads_per_worker=2,
)
dask_client = dask.distributed.Client(cluster)
db = dict(
host="db-host",
username="user",
# etc etc
)
dask_client.register_worker_plugin(LGInferWorkerPlugin, "set-dbs", db=db)
dask_client.start()
You can give the plugin somewhat helpful names, and pass in kwargs to be used in the plugin's __init__.

SQLAlchemy does not update/expire model instances with external changes

Recently I came across strange behavior of SQLAlchemy regarding refreshing/populating model instances with the the changes that were made outside of the current session. I created the following minimal working example and was able to reproduce problem with it.
from time import sleep
from sqlalchemy import orm, create_engine, Column, BigInteger, Integer
from sqlalchemy.ext.declarative import declarative_base
DATABASE_URI = "postgresql://{user}:{password}#{host}:{port}/{name}".format(
user="postgres",
password="postgres",
host="127.0.0.1",
name="so_sqlalchemy",
port="5432",
)
class SQLAlchemy:
def __init__(self, db_url, autocommit=False, autoflush=True):
self.engine = create_engine(db_url)
self.session = None
self.autocommit = autocommit
self.autoflush = autoflush
def connect(self):
session_maker = orm.sessionmaker(
bind=self.engine,
autocommit=self.autocommit,
autoflush=self.autoflush,
expire_on_commit=True
)
self.session = orm.scoped_session(session_maker)
def disconnect(self):
self.session.flush()
self.session.close()
self.session.remove()
self.session = None
BaseModel = declarative_base()
class TestModel(BaseModel):
__tablename__ = "test_models"
id = Column(BigInteger, primary_key=True, nullable=False)
field = Column(Integer, nullable=False)
def loop(db):
while True:
with db.session.begin():
t = db.session.query(TestModel).with_for_update().get(1)
if t is None:
print("No entry in db, creating...")
t = TestModel(id=1, field=0)
db.session.add(t)
db.session.flush()
print(f"t.field value is {t.field}")
t.field += 1
print(f"t.field value before flush is {t.field}")
db.session.flush()
print(f"t.field value after flush is {t.field}")
print(f"t.field value after transaction is {t.field}")
print("Sleeping for 2 seconds.")
sleep(2.0)
def main():
db = SQLAlchemy(DATABASE_URI, autocommit=True, autoflush=True)
db.connect()
try:
loop(db)
except KeyboardInterrupt:
print("Canceled")
if __name__ == '__main__':
main()
My requirements.txt file looks like this:
alembic==1.0.10
psycopg2-binary==2.8.2
sqlalchemy==1.3.3
If I run the script (I use Python 3.7.3 on my laptop running Ubuntu 16.04), it will nicely increment a value every two seconds as expected:
t.field value is 0
t.field value before flush is 1
t.field value after flush is 1
t.field value after transaction is 1
Sleeping for 2 seconds.
t.field value is 1
t.field value before flush is 2
t.field value after flush is 2
t.field value after transaction is 2
Sleeping for 2 seconds.
...
Now at some point I open postgres database shell and begin another transaction:
so_sqlalchemy=# BEGIN;
BEGIN
so_sqlalchemy=# UPDATE test_models SET field=100 WHERE id=1;
UPDATE 1
so_sqlalchemy=# COMMIT;
COMMIT
As soon as I press Enter after the UPDATE query, the script blocks as expected, as I'm issuing SELECT ... FOR UPDATE query there. However, when I commit the transaction in the database shell, script continues from the previous value (say, 27) and does not detect that external transaction has changed the value of field in database to 100.
My question is, why does this happen at all? There are several factors that seem to contradict the current behavior:
I'm using expire_on_commit setting set to True, which seems to imply that every model instance that has been used in transaction will be marked as expired after the transaction has been committed. (Quoting documentation, "When True, all instances will be fully expired after each commit(), so that all attribute/object access subsequent to a completed transaction will load from the most recent database state.").
I'm not accessing some old model instance but rather issue completely new query every time. As far as I understand, this should lead to direct query to the database and not access cached instance. I can confirm that this is indeed the case if I turn sqlalchemy debug log on.
The quick and dirty fix for this problem is to call db.session.expire_all() right after the transaction has begun, but this seems very inelegant and counter-intuitive. I would be very glad to understand what's wrong with the way I'm working with sqlalchemy here.

I ran into a very similar situation with MySQL. I needed to "see" changes to the table that were coming from external sources in the middle of my code's database operations. I ended up having to set autocommit=True in my session call and use the begin() / commit() methods of the session to "see" data that was updated externally.
The SQLAlchemy docs say this is a legacy configuration:
Warning
“autocommit” mode is a legacy mode of use and should not be considered for new projects.
but also say in the next paragraph:
Modern usage of “autocommit mode” tends to be for framework integrations that wish to control specifically when the “begin” state occurs
So it doesn't seem to be clear which statement is correct.

Python - Collect data from APIs, Cache and Push To DB - in Parallel

I'm having a hard time figuring it out how to develop the phase 3 of this algorithm:
Fetch data from a series of APIs
Store the data in the script until a certain condition is reached (cache and don't disturb the DB)
Push that structured data to a database AND at the same time continue with 1 (launch 1 without wait to complete the upload on the DB, the two things should go in parallel)
import requests
import time
from sqlalchemy import schema, types
from sqlalchemy.engine import create_engine
import threading
# I usually work on postgres
meta = schema.MetaData(schema="example")
# table one
table_api_one = schema.Table('api_one', meta,
schema.Column('id', types.Integer, primary_key=True),
schema.Column('field_one', types.Unicode(255), default=u''),
schema.Column('field_two', types.BigInteger()),
)
# table two
table_api_two = schema.Table('api_two', meta,
schema.Column('id', types.Integer, primary_key=True),
schema.Column('field_one', types.Unicode(255), default=u''),
schema.Column('field_two', types.BigInteger()),
)
# create tables
engine = create_engine("postgres://......", echo=False, pool_size=15, max_overflow=15)
meta.bind = engine
meta.create_all(checkfirst=True)
# get the data from the API and return data as JSON
def getdatafrom(url):
data = requests.get(url)
structured = data.json()
return structured
# push the data to the DB
def flush(list_one,list_two):
connection = engine.connect()
# both lists are list of json
connection.execute(table_api_one.insert(),list_one)
connection.execute(table_api_two.insert(),list_two)
connection.close()
# start doing something
def main():
timesleep = 30
flush_limit = 10
threading.Timer(timesleep * flush_limit, main).start()
data_api_one = []
data_api_two = []
# repeat the process 10 times (flush_limit) avoiding to keep to busy the DB
WHILE len(data_api_one) > flush_limit AND len(data_api_two) > flush_limit:
data_api_one.append(getdatafrom("http://www.apiurlone.com/api...").copy())
data_api_two.append(getdatafrom("http://www.apiurltwo.com/api...").copy())
time.sleep(timesleep)
# push the data when the limit is reached
flush(data_api_one,data_api_two)
# start the example
main()
In this example script, the thread is launched every 10 * 30 sec a main() (avoid overlapping the threads)
but, for this algorithm during the time of the flush() the script stop collecting the data from the APIs.
How it's possible to flush and keep getting the data from the APIs continuously?
thanks!

Usual approach is a Queue object (from module named Queue or queue, depending on Python version).
Create a producer function (running in one thread) which collects api data and when flushing puts it in the queue and a consumer function running in another thread waiting to get the data from the queue and store it to the database.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multi Processing with sqlalchemy - python

Related

SQLAlchemy inserts row to MYSQL database, then removed 3-5 seconds after. No traceback. Python3

How to commit multiple records at once the fastest way using postgres\sqlalchemy?

How to create a database connect engine in each Dask sub process to parallel thousands of sql query, without recreating engine in every query

SQLAlchemy does not update/expire model instances with external changes

Python - Collect data from APIs, Cache and Push To DB - in Parallel

Categories

Resources