In production environments where Django is running on Apache or with multiple Gunicorn workers, it runs the risk of concurrency issues.
As such, I was pretty surprised to find that Django's ORM doesn't explicitly support table/row locking. It supports transactions very handedly, but that only solves half of the concurrency problem.
With a MySQL backend, what is the correct way to perform locking in Django? Or is there something else at play in Django's framework that makes them unnecessary?
Django does not explicitly provide an API to perform table locking. In my experience, well-designed code rarely needs to lock a whole table, and most concurrency issues can be solved with row-level locking. It's an last-ditch effort: it doesn't solve concurrency, it simply kills any attempt at concurrency.
If you really need table-level locking, you can use a cursor and execute raw SQL statements:
from django.db import connection
with connection.cursor() as cursor:
cursor.execute("LOCK TABLES %s READ", [tablename])
try:
...
finally:
cursor.execute("UNLOCK TABLES;")
Consider setting the transaction isolation level to serializable, and using a MySQL table type that supports transactions (MyISAM does not, InnoDB does.)
After ensuring you have a backend that supports transactions, you'd then need to disable autocommit (https://docs.djangoproject.com/en/1.8/topics/db/transactions/#autocommit-details) and then ensure that your code issues the appropriate commit or rollback statement at the end of what you consider to be transactions.
There is an example or two in the above referenced docs..
Doing this requires a bit more work and consideration, but provides you with transactions.
Related
I have a file called db.py with the following code:
from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
engine = create_engine('sqlite:///my_db.sqlite')
session = scoped_session(sessionmaker(bind=engine,autoflush=True))
I am trying to import this file in various subprocesses started using a spawn context (potentially important, since various fixes that worked for fork don't seem to work for spawn)
The import statement is something like:
from db import session
and then I use this session ad libitum without worrying about concurrency, assuming SQLite's internal locking mechanism will order transactions as to avoid concurrency error, I don't really care about transaction order.
This seems to result in errors like the following:
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 139813508335360 and this is thread id 139818279995200.
Mind you, this doesn't directly seem to affect my program, every transaction goes through just fine, but I am still worried about what's causing this.
My understanding was that scoped_session was thread-local, so I could import it however I want without issues. Furthermore, my assumption was that sqlalchemy will always handle the closing of connections and that sqllite will handle ordering (i.e. make a session wait for another seesion to end until it can do any transaction).
Obviously one of these assumptions is wrong, or I am misunderstanding something basic about the mechanism here, but I can't quite figure out what. Any suggestions would be useful.
The problem isn't about thread-local sessions, it's that the original connection object is in a different thread to those sessions. SQLite disables using a connection across different threads by default.
The simplest answer to your question is to turn off sqlite's same thread checking. In SQLAlchemy you can achieve this by specifying it as part of your database URL:
engine = create_engine('sqlite:///my_db.sqlite?check_same_thread=False')
I'm guessing that will do away with the errors, at least.
Depending on what you're doing, this may still be dangerous - if you're ensuring your transactions are serialised (that is, one after the other, never overlapping or simultaneous) then you're probably fine. If you can't guarantee that then you're risking data corruption, in which case you should consider a) using a database backend that can handle concurrent writes, or b) creating an intermediary app or service that solely manages sqlite reads and writes and that your other apps can communicate with. That latter option sounds fun but be warned you may end up reinventing the wheel when you're better off just spinning up a Postgres container or something.
I've been trying to test this out, but haven't been able to come to a definitive answer. I'm using SQLAlchemy on top of MySQL and trying to prevent having threads that do a select, get a SHARED_READ lock on some table, and then hold on to it (preventing future DDL operations until it's released). This happens when queries aren't committed. I'm using SQLAlchemy Core, where as far as I could tell .execute() essentially works in autocommit mode, issuing a COMMIT after everything it runs unless explicitly told we're in a transaction. Nevertheless, in show processlist, I'm seeing sleeping threads that still have SHARED_READ locks on a table they once queried. What gives?
Assuming from your post you're operating in "non-transactional" mode, either using an SQLAlchemy Connection without an ongoing transaction, or the shorthand engine.execute(). In this mode of operation SQLAlchemy will detect INSERT, UPDATE, DELETE, and DDL statements and issue a commit after automatically, but not for everything, like SELECT statements. See "Understanding Autocommit". For selects of mutating stored procedures and such that do require a commit, use
conn.execute(text('SELECT ...').execution_options(autocommit=True))
You should also consider closing connections when the thread is done with them for the time being. Closing will call rollback() on the underlying DBAPI connection, which per PEP-0249 is (probably) always in transactional state. This will remove the transactional state and/or locks, and returns the connection to the connection pool. This way you shouldn't need to worry about selects not autocommitting.
It is possible to do async i/o with psycopg2 (which can be read here) however I'm not sure how to do async transactions. Consider this sequence of things:
Green Thread 1 starts transaction T
GT1 issues update
GT2 issues one transactional update
GT1 issues update
GT1 commits transaction T
I assume that GT1 updates conflict with GT2 updates.
Now according to docs:
Cursors created from the same connection are not isolated, i.e., any
changes done to the database by a cursor are immediately visible by
the other cursors.
so we can't implement the flow above on cursors. We could implement it on different connections but since we are doing async then spawning (potentially) thousands db connections might be bad (not to mention that Postgres can't handle so much out-of-the-box).
The other option is to have a pool of connections and reuse them. But then if we issue X parallel transactions all other green threads are blocked until some connection is available. Thus the actual amount of useful green threads is ~X (assuming the app is heavily db bound) which raises question: why would we use async to begin with?
Now this question can actually be generalized to DB API 2.0. Maybe the real answer is that DB API 2.0 is not suited for async programming? How would we do async io on Postgresql then? Maybe some other library?
Or maybe is that because the postgresql protocol is actually synchronous? It would be perfect to be able to "write" to any transaction at any time (per connection). Postgresql would have to expose transaction's id for that. Is it doable? Maybe two-phase commit is the answer?
Or am I missing something here?
EDIT: This seems to be a general problem with SQL since BEGIN; COMMIT; semantics just can't be used asynchronously efficiently.
Actually you can use BEGIN; and COMMIT; with async. What you need is a connection pool setup and make sure each green thread gets its own connection (Just like a real thread would in a multithreaded application).
You cannot use psycopg2's builtin transaction handling.
I sometimes run python scripts that access the same database concurrently. This often causes database lock errors. I would like the script to then retry ASAP as the database is never locked for long.
Is there a better way to do this than with a try except inside a while loop and does that method have any problems?
Increase the timeout parameter in your calls to the connect function:
db = sqlite3.connect("myfile.db", timeout = 20)
If you are looking for concurrency, SQlite is not the answer. The engine doesn't perform good when concurrency is needed, especially when writing from different threads, even if the tables are not the same.
If your scripts are accessing different tables, and they have no relationships at DB level (i.e. declared FK's), you can separate them in different databases and then your concurrency issue will be solved.
If they are linked, but you can link them in the app level (script), you can separate them as well.
The best practice in those cases is implementing a lock mechanism with events, but honestly I have no idea how to implement such in phyton.
I use MySQL with MySQLdb module in Python, in Django.
I'm running in autocommit mode in this case (and Django's transaction.is_managed() actually returns False).
I have several processes interacting with the database.
One process fetches all Task models with Task.objects.all()
Then another process adds a Task model (I can see it in a database management application).
If I call Task.objects.all() on the first process, I don't see anything. But if I call connection._commit() and then Task.objects.all(), I see the new Task.
My question is: Is there any caching involved at connection level? And is it a normal behaviour (it does not seems to me)?
This certainly seems autocommit/table locking - related.
If mysqldb implements the dbapi2 spec it will probably have a connection running as one single continuous transaction. When you say: 'running in autocommit mode': do you mean MySQL itself or the mysqldb module? Or Django?
Not intermittently commiting perfectly explains the behaviour you are getting:
i) a connection implemented as one single transaction in mysqldb (by default, probably)
ii) not opening/closing connections only when needed but (re)using one (or more) persistent database connections (my guess, could be Django-architecture-inherited).
ii) your selects ('reads') cause a 'simple read lock' on a table (which means other connections can still 'read' this table but connections wanting to 'write data' can't (immediately) because this lock prevents them from getting an 'exclusive lock' (needed 'for writing') on this table. The writing is thus postponed indefinitely (until it can get a (short) exclusive lock on the table for writing - when you close the connection or manually commit).
I'd do the following in your case:
find out which table locks are on your database during the scenario above
read about Django and transactions here. A quick skim suggests using standard Django functionality implicitely causes commits. This means sending handcrafted SQL maybe won't (insert, update...).