Execute multiple independent statements in SQLAlchemy Core? - python

I'm using SQLAlchemy Core to run a few independent statements. The statements are to separate tables and unrelated. Because of that I can't use the standard table.insert() with multiple dictionaries of params passed in. Right now, I'm doing this:
sql_conn.execute(query1)
sql_conn.execute(query2)
Is there any way I can run these in one shot instead of needing two back-and-forths to the db? I'm on MySQL 5.7 and Python 2.7.11.

Sounds like you want a Transaction:
with engine.connect() as sql_conn:
with sql_conn.begin():
sql_conn.execute(query1)
sql_conn.execute(query2)
There is an implicit sql_conn.commit() above (when using the context manager) which commits the changes to the database in one trip. If you want to do it explicitly, it's done like this:
from sqlalchemy import create_engine
engine = create_engine("postgresql://scott:tiger#localhost/test")
connection = engine.connect()
trans = connection.begin()
connection.execute(text("insert into x (a, b) values (1, 2)"))
trans.commit()
https://docs.sqlalchemy.org/en/14/core/connections.html#basic-usage
While this is mostly geared towards creating real database transactions, it has a useful side effect for your use case, where it will maintain a "virtual" transaction in SQLAlchemy, see this link for more info:
https://docs.sqlalchemy.org/en/14/orm/session_transaction.html#session-level-vs-engine-level-transaction-control
The Session tracks the state of a single “virtual” transaction at a time, using an object called SessionTransaction. This object then makes use of the underlying Engine or engines to which the Session object is bound in order to start real connection-level transactions using the Connection object as needed.
This “virtual” transaction is created automatically when needed, or
can alternatively be started using the Session.begin() method. To as
great a degree as possible, Python context manager use is supported
both at the level of creating Session objects as well as to maintain
the scope of the SessionTransaction.
The above describes the ORM functionality but the link shows that it has parity with the Core functionality.

It is neither wise, nor practical, to run two queries at once. I am referring to having a single call to the server with two SQL statements one after another: "SELECT ...; SELECT ...;"
It is not wise allowing such give hackers another way to do nasty things via "SQL Injection".
On the other hand, it is possible, but not necessarily practical. You would create a Stored Procedure that contains any number of related (or unrelated) queries in it. Then CALL that procedure. There some things that may make it impractical:
The only way to get data in is via a finite number of scalar arguments.
The output comes back as multiple resultsets; you need to code differently to see what happened.
Roundtrip latency is insignificant if you are on the same machine with the MySQL server. It can usually be ignored even if the two servers are in the same datacenter. Latency becomes important when the client and server are separated by a long distance. For cross-Atlantic latency, we are talking over 100ms. Brazil to China is about 250ms. (Be glad we are no living on Jupiter.)

Related

How to use sqlite across multiple (spawned) python processes via sqlalchemy

I have a file called db.py with the following code:
from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
engine = create_engine('sqlite:///my_db.sqlite')
session = scoped_session(sessionmaker(bind=engine,autoflush=True))
I am trying to import this file in various subprocesses started using a spawn context (potentially important, since various fixes that worked for fork don't seem to work for spawn)
The import statement is something like:
from db import session
and then I use this session ad libitum without worrying about concurrency, assuming SQLite's internal locking mechanism will order transactions as to avoid concurrency error, I don't really care about transaction order.
This seems to result in errors like the following:
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 139813508335360 and this is thread id 139818279995200.
Mind you, this doesn't directly seem to affect my program, every transaction goes through just fine, but I am still worried about what's causing this.
My understanding was that scoped_session was thread-local, so I could import it however I want without issues. Furthermore, my assumption was that sqlalchemy will always handle the closing of connections and that sqllite will handle ordering (i.e. make a session wait for another seesion to end until it can do any transaction).
Obviously one of these assumptions is wrong, or I am misunderstanding something basic about the mechanism here, but I can't quite figure out what. Any suggestions would be useful.
The problem isn't about thread-local sessions, it's that the original connection object is in a different thread to those sessions. SQLite disables using a connection across different threads by default.
The simplest answer to your question is to turn off sqlite's same thread checking. In SQLAlchemy you can achieve this by specifying it as part of your database URL:
engine = create_engine('sqlite:///my_db.sqlite?check_same_thread=False')
I'm guessing that will do away with the errors, at least.
Depending on what you're doing, this may still be dangerous - if you're ensuring your transactions are serialised (that is, one after the other, never overlapping or simultaneous) then you're probably fine. If you can't guarantee that then you're risking data corruption, in which case you should consider a) using a database backend that can handle concurrent writes, or b) creating an intermediary app or service that solely manages sqlite reads and writes and that your other apps can communicate with. That latter option sounds fun but be warned you may end up reinventing the wheel when you're better off just spinning up a Postgres container or something.

Is it thread-safe to use SQLAlchemy with engine/connections instead of sessions?

I've got a simple webservice which uses SQLAlchemy to connect to a database using the pattern
engine = create_engine(database_uri)
connection = engine.connect()
In each endpoint of the service, I then use the same connection, in the following fashion:
for result in connection.execute(query):
<do something fancy>
Since Sessions are not thread-safe, I'm afraid that connections aren't either.
Can I safely keep doing this? If not, what's the easiest way to fix it?
Minor note -- I don't know if the service will ever run multithreaded, but I'd rather be sure that I don't get into trouble when it does.
Short answer: you should be fine.
There is a difference between a connection and a Session. The short description is that connections represent just that… a connection to a database. Information you pass into it will come out pretty plain. It won't keep track of your transactions unless you tell it to, and it won't care about what order you send it data. So if it matters that you create your Widget object before you create your Sprocket object, then you better call that in a thread-safe context. Same generally goes for if you want to keep track of a database transaction.
Session, on the other hand, keeps track of data and transactions for you. If you check out the source code, you'll notice quite a bit of back and forth over database transactions and without a way to know that you have everything you want in a transaction, you could very well end up committing in one thread while you expect to be able to add another object (or several) in another.
In case you don't know what a transaction is this is Wikipedia, but the short version is that transactions help make sure your data stays stable. If you have 15 inserts and updates, and insert 15 fails, you might not want to make the other 14. A transaction would let you cancel the entire operation in bulk.

How to diagnose extra SQLAlchemy connections in Pyramid

When my app runs, I'm very frequently getting issues around the connection pooling (one is "QueuePool limit of size 5 overflow 10 reached", another is "FATAL: remaining connection slots are reserved for non-replication superuser connections").
I have a feeling that it's due to some code not closing connections properly, or other code greedily trying to open new ones when it shouldn't, but I'm using the default SQL Alchemy settings so I assume the pool connection defaults shouldn't be unreasonable. We are using the scoped_session(sessionmaker()) way of creating the session so multiple threads are supported.
So my main question is if there is a tool or way to find out where the connections are going? Short of being able to see as soon as a new one is created (that is not supposed to be created), are there any obvious anti-patterns that might result in this effect?
Pyramid is very un-opinionated and with DB connections, there seem to be two main approaches (equally supported by Pyramid it would seem). In our case, the code base when I started the job used one approach (I'll call it the "globals" approach) and we've agreed to switch to another approach that relies less on globals and more on Pythonic idioms.
About our architecture: the application comprises one repo which houses the Pyramid project and then sources a number of other git modules, each of which had their own connection setup. The "globals" way connects to the database in a very non-ORM fashion, eg.:
(in each repo's __init__ file)
def load_database:
global tables
tables['table_name'] = Table(
'table_name', metadata,
Column('column_name', String),
)
There are related globals that are frequently peppered all over the code:
def function_needing_data(field_value):
global db, tables
select = sqlalchemy.sql.select(
[tables['table_name'].c.data], tables['table_name'].c.name == field_value)
return db.execute(select)
This tables variable is latched onto within each git repo which adds some more tables definitions and somehow the global tables manages to work, providing access to all of the tables.
The approach that we've moved to (although at this time, there are parts of both approaches still in the code) is via a centralised connection, binding all of the metadata to it and then querying the db in an ORM approach:
(model)
class ModelName(MetaDataBase):
__tablename__ = "models_table_name"
... (field values)
(function requiring data)
from models.db import DBSession
from models.model_name import ModelName
def function_needing_data(field_value):
return DBSession.query(ModelName).filter(
ModelName.field_value == field_value).all()
We've largely moved the code over to the latter approach which feels right, but perhaps I'm mistaken in my intentions. I don't know if there is anything inherently good or bad in either approach but could this (one of the approaches) be part of the problem so we keep running out of connections? Is there a telltale sign that I should look out for?
It appears that Pyramid functions best (in terms of handling the connection pool) when you use the Pyramid transaction manager (pyramid_tm). This excellent article by Jon Rosebaugh provides some helpful insight into both how Pyramid apps typically set up their database connections and how they should set them up.
In my case, it was necessary to include the pyramid_tm package and then remove a few occurrences where we were manually committing session changes since pyramid_tm will automatically commit changes if it doesn't see a reason not to.
[Update]
I continued to have connection pooling issues although much fewer of them. After a lot of debugging, I found that the pyramid transaction manager (if you're using it correctly) should not be the issue at all. The issue to the other connection pooling issues I had had to do with scripts that ran via cron jobs. A script will release it's connections when it's finished, but bad code design may result in situations where the same script can be opened up and starts running while the previous one is running (causing them both to run slower, slow enough to have both running while a third instance of the script starts and so on).
This is a more language- and database-agnostic error since it stems from poor job-scripting design but it's worth keeping in mind. In my case, the script had an "&" at the end so that each instance started as a background process, waited 10 seconds, then spawned another, rather than making sure the first job started AND completed, then waited 10 seconds, then started another.
Hope this helps when debugging this very frustrating and thorny issue.

Writing in SQLite multiple Threads in Python

I've got a sqlite3 database and I want to write in it from multiple threads. I've got multiple ideas but I'm not sure which I should implement.
create multiple connection, detect and waif if the DB is locked
use one connection and try to make use of Serialized connections (which don't seem to be implemented in python)
have a background process with a single connection, which collects the queries from all threads and then executes them on their behalft
forget about SQlite and use something like Postgresql
What are the advances of these different approaches and which is most likely to be fruitful? Are there any other possibilities?
Try to use https://pypi.python.org/pypi/sqlitedict
A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
But take into account "Concurrent requests are still serialized internally, so this "multithreaded support" doesn't give you any performance benefits. It is a work-around for sqlite limitations in Python."
PostgreSQL, MySQL, etc. give you the better performance for several connections in one time
I used method 1 before. It is the easiest in coding. Since that project has a small website, each query take only several milliseconds. All the users requests can be processed promptly.
I also used method 3 before. Because when the query take longer time, it is better to queue the queries since frequent "detect and wait" makes no sense here. And would require a classic consumer-producer model. It would require more time to code.
But if the query is really heavy and frequent. I suggest look to other db like MS SQL/MySQL.

SQLite3 and Multiprocessing

I noticed that sqlite3 isn´t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?
First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?
I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.
sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/

Categories

Resources