sqlalchemy and SQLite shared cache

sqlalchemy and SQLite shared cache - python

SQLite supports a "shared cache" for :memory: databases when they are opened with a special URI (according to sqlite.org):
[T]he same in-memory database can be opened by two or more database
connections as follows:
rc = sqlite3_open("file::memory:?cache=shared",&db);
I can take advantage of this in Python 3.4 by using the URI parameter for sqlite3.connect():
sqlite3.connect('file::memory:?cache=shared', uri=True)
However, I can't seem to get the same thing working for SQLAlchemy:
engine = sqlalchemy.create_engine('sqlite:///:memory:?cache=shared')
engine.connect()
...
TypeError: 'cache' is an invalid keyword argument for this function
Is there some way to get SQLAlchemy to make use of the shared cache?
Edit:
On Python 3.4, I can use the creator argument to create_engine to solve the problem, but the problem remains on other Python versions:
creator = lambda: sqlite3.connect('file::memory:?cache=shared', uri=True)
engine = sqlalchemy.create_engine('sqlite://', creator=creator)
engine.connect()

You should avoid passing uri=True on older Python versions and the problem will be fixed:
import sqlite3
import sys
import sqlalchemy
DB_URI = 'file::memory:?cache=shared'
PY2 = sys.version_info.major == 2
if PY2:
params = {}
else:
params = {'uri': True}
creator = lambda: sqlite3.connect(DB_URI, **params)
engine = sqlalchemy.create_engine('sqlite:///:memory:', creator=creator)
engine.connect()

SQLAlchemy docs about the SQLite dialect describe the problem and a solution in detail:
Threading/Pooling Behavior
Pysqlite’s default behavior is to prohibit
the usage of a single connection in more than one thread. This is
originally intended to work with older versions of SQLite that did not
support multithreaded operation under various circumstances. In
particular, older SQLite versions did not allow a :memory: database to
be used in multiple threads under any circumstances.
Pysqlite does include a now-undocumented flag known as
check_same_thread which will disable this check, however note that
pysqlite connections are still not safe to use in concurrently in
multiple threads. In particular, any statement execution calls would
need to be externally mutexed, as Pysqlite does not provide for
thread-safe propagation of error messages among other things. So while
even :memory: databases can be shared among threads in modern SQLite,
Pysqlite doesn’t provide enough thread-safety to make this usage worth
it.
SQLAlchemy sets up pooling to work with Pysqlite’s default behavior:
When a :memory: SQLite database is specified, the dialect by default
will use SingletonThreadPool. This pool maintains a single connection
per thread, so that all access to the engine within the current thread
use the same :memory: database - other threads would access a
different :memory: database.
When a file-based database is specified, the dialect will use NullPool
as the source of connections. This pool closes and discards
connections which are returned to the pool immediately. SQLite
file-based connections have extremely low overhead, so pooling is not
necessary. The scheme also prevents a connection from being used again
in a different thread and works best with SQLite’s coarse-grained file
locking.
Using a Memory Database in Multiple Threads
To use a :memory: database
in a multithreaded scenario, the same connection object must be shared
among threads, since the database exists only within the scope of that
connection. The StaticPool implementation will maintain a single
connection globally, and the check_same_thread flag can be passed to
Pysqlite as False:
from sqlalchemy.pool import StaticPool
engine = create_engine('sqlite://',
connect_args={'check_same_thread':False},
poolclass=StaticPool)
Note that using a :memory: database in multiple threads requires a recent version of SQLite.
Source: https://docs.sqlalchemy.org/en/13/dialects/sqlite.html#threading-pooling-behavior

Related

Working with engines in sqlalchemy

I am curious about the proper way to close a connection when reading a query through pandas with read_sql_query. I've been using:
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('credentials')
data = pd.read_sql_query(sql_query, engine)
though it seems like the traditional usage is this:
engine = create_engine('credentials')
connection = engine.connect()
result = connection.execute(users_table.select())
for row in result:
# ....
connection.close()
If I am not creating "connection" with engine.connect() as in the second approach, how do I close my connection? Or, is it closed after pd.read_sql_query is finished?

From http://docs.sqlalchemy.org/en/latest/core/connections.html
The Engine is intended to normally be a permanent fixture established up-front and maintained throughout the lifespan of an application. It is not intended to be created and disposed on a per-connection basis; it is instead a registry that maintains both a pool of connections as well as configurational information about the database and DBAPI in use, as well as some degree of internal caching of per-database resources.
The Engine object lazily allocates Connections on demand from an internal pool. These connections aren't necessarily closed when you call the close method of the individual Connection objects, just returned to that pool, or "checked in".
If you explicitly need the connections to be closed, you should check in all connections and then call engine.dispose(), or you may need to change the Pooling strategy your Engine object uses, see http://docs.sqlalchemy.org/en/latest/core/pooling.html#pool-switching.

How to perform table/row locks in Django

In production environments where Django is running on Apache or with multiple Gunicorn workers, it runs the risk of concurrency issues.
As such, I was pretty surprised to find that Django's ORM doesn't explicitly support table/row locking. It supports transactions very handedly, but that only solves half of the concurrency problem.
With a MySQL backend, what is the correct way to perform locking in Django? Or is there something else at play in Django's framework that makes them unnecessary?

Django does not explicitly provide an API to perform table locking. In my experience, well-designed code rarely needs to lock a whole table, and most concurrency issues can be solved with row-level locking. It's an last-ditch effort: it doesn't solve concurrency, it simply kills any attempt at concurrency.
If you really need table-level locking, you can use a cursor and execute raw SQL statements:
from django.db import connection
with connection.cursor() as cursor:
cursor.execute("LOCK TABLES %s READ", [tablename])
try:
...
finally:
cursor.execute("UNLOCK TABLES;")

Consider setting the transaction isolation level to serializable, and using a MySQL table type that supports transactions (MyISAM does not, InnoDB does.)
After ensuring you have a backend that supports transactions, you'd then need to disable autocommit (https://docs.djangoproject.com/en/1.8/topics/db/transactions/#autocommit-details) and then ensure that your code issues the appropriate commit or rollback statement at the end of what you consider to be transactions.
There is an example or two in the above referenced docs..
Doing this requires a bit more work and consideration, but provides you with transactions.

Multi-threaded use of SQLAlchemy

I want to make a Database Application Programming Interface written in Python and using SQLAlchemy (or any other database connectors if it is told that using SQLAlchemy for this kind of task is not the good way to go). The setup is a MySQL server running on Linux or BSD and a the Python software running on a Linux or BSD machine (Either foreign or local).
Basically what I want to do is spawn a new thread for each connections and the protocol would be custom and quite simple, although for each requests I would like to open a new transaction (or session as I have read) and then I need to commit the session. The problem I am facing right now is that there is high probability that another sessions happen at the same time from another connection.
My question here is what should I do to handle this situation?
Should I use a lock so only a single session can run at the same time?
Are sessions actually thread-safe and I am wrong about thinking that they are not?
Is there a better way to handle this situation?
Is threading the way not-to-go?

Session objects are not thread-safe, but are thread-local. From the docs:
"The Session object is entirely designed to be used in a non-concurrent fashion, which in terms of multithreading means "only in one thread at a time" .. some process needs to be in place such that mutltiple calls across many threads don’t actually get a handle to the same session. We call this notion thread local storage."
If you don't want to do the work of managing threads and sessions yourself, SQLAlchemy has the ScopedSession object to take care of this for you:
The ScopedSession object by default uses threading.local() as storage, so that a single Session is maintained for all who call upon the ScopedSession registry, but only within the scope of a single thread. Callers who call upon the registry in a different thread get a Session instance that is local to that other thread.
Using this technique, the ScopedSession provides a quick and relatively simple way of providing a single, global object in an application that is safe to be called upon from multiple threads.
See the examples in Contextual/Thread-local Sessions for setting up your own thread-safe sessions:
# set up a scoped_session
from sqlalchemy.orm import scoped_session
from sqlalchemy.orm import sessionmaker
session_factory = sessionmaker(bind=some_engine)
Session = scoped_session(session_factory)
# now all calls to Session() will create a thread-local session
some_session = Session()
# you can now use some_session to run multiple queries, etc.
# remember to close it when you're finished!
Session.remove()

File locks in SQLite

I'm writing my first SQLAlchemy (0.6.8)/Python (2.7.1) program, sitting on top of SQLite (3.7.6.3, I think), running on Windows Vista.
In order to perform unit-testing, I am pointing SQLite to a test database, and my unit-test scripts routinely delete the database file, so I am continuously working with a known initial state.
Sometimes my (single-threaded) unit-tests fail to remove the file:
WindowsError: [Error 32] The process cannot access the file because it is being used by another process
The only process that uses the file is the unit-test harness. Clearly, some lock is not being released by one of my completed unit-tests, preventing the next unit-test in the same process from deleting the file.
I have searched all the places I have created a session and confirmed there is a corresponding session.commit() or session.rollback().
I have searched for all session.commit() and session.rollback() calls in my code, and added a session.close() call immediately afterwards, in an attempt to explicitly release any transactional locks, but it hasn't helped.
Are there any secrets to ensuring the remaining locks are removed at the end of a transaction to permit the file to be deleted?

Someone had a similar problem: http://www.mail-archive.com/sqlalchemy#googlegroups.com/msg20724.html
You should use a NullPool at the connection establishement to ensure that no active connection stay after session.close()
from sqlalchemy import create_engine
from sqlalchemy.pool import NullPool
to_engine = create_engine('sqlite:///%s' % temp_file_name, poolclass=NullPool)
Reference: http://www.sqlalchemy.org/docs/06/core/pooling.html?highlight=pool#sqlalchemy.pool
This is only required in SQLAlchemy prior to 0.7.0. After 0.7.0, this became the default behaviour for SQLite. Reference: http://www.sqlalchemy.org/docs/core/pooling.html?highlight=pool#sqlalchemy.pool

Do you require shared access to the database during unit tests? If not, use a in-memory SQLite database for those tests. From the SQLAlchemy documentation:
The sqlite :memory: identifier is the default if no filepath is present. Specify sqlite:// and nothing else:
# in-memory database
e = create_engine('sqlite://')
No need to manage temporary files, no locking semantics, guaranteed a clean slate between unit tests, etc.

caching issues in MySQL response with MySQLdb in Django

I use MySQL with MySQLdb module in Python, in Django.
I'm running in autocommit mode in this case (and Django's transaction.is_managed() actually returns False).
I have several processes interacting with the database.
One process fetches all Task models with Task.objects.all()
Then another process adds a Task model (I can see it in a database management application).
If I call Task.objects.all() on the first process, I don't see anything. But if I call connection._commit() and then Task.objects.all(), I see the new Task.
My question is: Is there any caching involved at connection level? And is it a normal behaviour (it does not seems to me)?

This certainly seems autocommit/table locking - related.
If mysqldb implements the dbapi2 spec it will probably have a connection running as one single continuous transaction. When you say: 'running in autocommit mode': do you mean MySQL itself or the mysqldb module? Or Django?
Not intermittently commiting perfectly explains the behaviour you are getting:
i) a connection implemented as one single transaction in mysqldb (by default, probably)
ii) not opening/closing connections only when needed but (re)using one (or more) persistent database connections (my guess, could be Django-architecture-inherited).
ii) your selects ('reads') cause a 'simple read lock' on a table (which means other connections can still 'read' this table but connections wanting to 'write data' can't (immediately) because this lock prevents them from getting an 'exclusive lock' (needed 'for writing') on this table. The writing is thus postponed indefinitely (until it can get a (short) exclusive lock on the table for writing - when you close the connection or manually commit).
I'd do the following in your case:
find out which table locks are on your database during the scenario above
read about Django and transactions here. A quick skim suggests using standard Django functionality implicitely causes commits. This means sending handcrafted SQL maybe won't (insert, update...).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.