Python async transactions psycopg2

Python async transactions psycopg2 - python

It is possible to do async i/o with psycopg2 (which can be read here) however I'm not sure how to do async transactions. Consider this sequence of things:
Green Thread 1 starts transaction T
GT1 issues update
GT2 issues one transactional update
GT1 issues update
GT1 commits transaction T
I assume that GT1 updates conflict with GT2 updates.
Now according to docs:
Cursors created from the same connection are not isolated, i.e., any
changes done to the database by a cursor are immediately visible by
the other cursors.
so we can't implement the flow above on cursors. We could implement it on different connections but since we are doing async then spawning (potentially) thousands db connections might be bad (not to mention that Postgres can't handle so much out-of-the-box).
The other option is to have a pool of connections and reuse them. But then if we issue X parallel transactions all other green threads are blocked until some connection is available. Thus the actual amount of useful green threads is ~X (assuming the app is heavily db bound) which raises question: why would we use async to begin with?
Now this question can actually be generalized to DB API 2.0. Maybe the real answer is that DB API 2.0 is not suited for async programming? How would we do async io on Postgresql then? Maybe some other library?
Or maybe is that because the postgresql protocol is actually synchronous? It would be perfect to be able to "write" to any transaction at any time (per connection). Postgresql would have to expose transaction's id for that. Is it doable? Maybe two-phase commit is the answer?
Or am I missing something here?
EDIT: This seems to be a general problem with SQL since BEGIN; COMMIT; semantics just can't be used asynchronously efficiently.

Actually you can use BEGIN; and COMMIT; with async. What you need is a connection pool setup and make sure each green thread gets its own connection (Just like a real thread would in a multithreaded application).
You cannot use psycopg2's builtin transaction handling.

Related

When do commits happen with SQLAlchemy Core?

I've been trying to test this out, but haven't been able to come to a definitive answer. I'm using SQLAlchemy on top of MySQL and trying to prevent having threads that do a select, get a SHARED_READ lock on some table, and then hold on to it (preventing future DDL operations until it's released). This happens when queries aren't committed. I'm using SQLAlchemy Core, where as far as I could tell .execute() essentially works in autocommit mode, issuing a COMMIT after everything it runs unless explicitly told we're in a transaction. Nevertheless, in show processlist, I'm seeing sleeping threads that still have SHARED_READ locks on a table they once queried. What gives?

Assuming from your post you're operating in "non-transactional" mode, either using an SQLAlchemy Connection without an ongoing transaction, or the shorthand engine.execute(). In this mode of operation SQLAlchemy will detect INSERT, UPDATE, DELETE, and DDL statements and issue a commit after automatically, but not for everything, like SELECT statements. See "Understanding Autocommit". For selects of mutating stored procedures and such that do require a commit, use
conn.execute(text('SELECT ...').execution_options(autocommit=True))
You should also consider closing connections when the thread is done with them for the time being. Closing will call rollback() on the underlying DBAPI connection, which per PEP-0249 is (probably) always in transactional state. This will remove the transactional state and/or locks, and returns the connection to the connection pool. This way you shouldn't need to worry about selects not autocommitting.

Reusing database connection for multiple requests

If I don't need transactions, can I reuse the same database connection for multiple requests?
Flask documentation says:
Because database connections encapsulate a transaction, we also need to make sure that only one request at the time uses the connection.
Here's how I understand the meaning of the above sentence:
Python DB-API connection can only handle one transaction at a time; to start a new transaction, one must first commit or roll back the previous one. So if each of our requests needs its own transaction, then of course each request needs its own database connection.
Please let me know if I got it wrong.
But let's say I set autocommit mode, and handle each request in a single SQL statement. Or, alternatively, let's say I only read - not write - to the database. In either case, it seems I can just reuse the same database connection for all my requests to save the overhead of multiple connections. But I'm not sure if there's any downside to this approach.
Edit: I can see one issue with what I'm proposing: each request might be handled by a different process. Since connections should probably not be reused across processes, let me clarify my question: I mean creating one connection per process, and using it for all requests that happen to be handled by this process.
On the other hand, the whole point of (green or native) threads is usually to serve one request per thread, so my proposed approach implies sharing connection across threads. It seems one connection can be used concurrently in multiple native threads, but not in multiple green threads.
So let's say for concreteness my environment is flask + gunicorn with multiple multi-threaded sync workers.

Based on #Craig Ringer comment on a different question, I think I know the answer.
The only possible advantage of connection sharing is performance (other factors - like transaction encapsulation and simplicity - favor a separate connection per request). And since a connection can't be shared across processes or green threads, it only has a chance with native threads. But psycopg2 (and presumably other drivers) doesn't allow concurrent access from the same connection. So unless each request spends very little time talking to the database, there is likely a performance hit, not benefit, from connection sharing.

How to perform table/row locks in Django

In production environments where Django is running on Apache or with multiple Gunicorn workers, it runs the risk of concurrency issues.
As such, I was pretty surprised to find that Django's ORM doesn't explicitly support table/row locking. It supports transactions very handedly, but that only solves half of the concurrency problem.
With a MySQL backend, what is the correct way to perform locking in Django? Or is there something else at play in Django's framework that makes them unnecessary?

Django does not explicitly provide an API to perform table locking. In my experience, well-designed code rarely needs to lock a whole table, and most concurrency issues can be solved with row-level locking. It's an last-ditch effort: it doesn't solve concurrency, it simply kills any attempt at concurrency.
If you really need table-level locking, you can use a cursor and execute raw SQL statements:
from django.db import connection
with connection.cursor() as cursor:
cursor.execute("LOCK TABLES %s READ", [tablename])
try:
...
finally:
cursor.execute("UNLOCK TABLES;")

Consider setting the transaction isolation level to serializable, and using a MySQL table type that supports transactions (MyISAM does not, InnoDB does.)
After ensuring you have a backend that supports transactions, you'd then need to disable autocommit (https://docs.djangoproject.com/en/1.8/topics/db/transactions/#autocommit-details) and then ensure that your code issues the appropriate commit or rollback statement at the end of what you consider to be transactions.
There is an example or two in the above referenced docs..
Doing this requires a bit more work and consideration, but provides you with transactions.

How to manage db connection especially in case of multi threading

I am working on an online judge.I am using python 2.7 and Mysql ( as I am working on back end-part)
My Method:
I create a main thread which pulls out submissions from database( 10 at a time) and puts them in a queue.Then I have multiple threads that take submissions from queue, evaluate it and write the result back to database.
Now I have some doubts(I know they are doubts from different topics but approach to some of them also is highly appreciated).
Currently when I start the threads I give them their own db connections, Which they use.Is this a good practice to give one connection per thread. Does sharing of connections between threads create problems.How do I go about this.
My main thread uses a single connection as its only work is to pull submissions from db and put then in queue(also update their status in db to Assessing Submission). But sometimes I get the error: Lost connection to Mysql server while querying. I keep getting it even when I stop the program and start it again.What do I do about it? Also should I implement a Pool of connections for only the main thread?
Also does a db connection stay alive for ever? What to do when its session memory etc gets exhausted how to handle that?

Use a connection pool. Sharing the database connection is not always bad but you have to be careful about it. You can try SQLAlchemy to manage a lot of this for you: http://docs.sqlalchemy.org/en/rel_0_8/orm/session.html#unitofwork-contextual
The server might be out of connections, your connection might have been killed because it uses too many resources.. etc. A connection pool could help you solve this.
It all depends, it could stay alive indefinitely theoretically, but usually you have a timeout somewhere.

If you give the same connection to every thread then the threads will not be able to query the database and race condition will occur. So you need to provide separate connection to every thread and indeed it is a good idea. Use a Connection Pool for the purpose it will help you get different connections.
Connection Pool will surely help.
Release the connection once your work is over. There is a limit to connection which is termed as connection time out. So you need to use some third party library to handle that, c3p0 is a good library which can help you in this.
Please refer the below link to configure it:
Best configuration of c3p0

Mysql connection pooling question: is it worth it?

I recall hearing that the connection process in mysql was designed to be very fast compared to other RDBMSes, and that therefore using a library that provides connection pooling (SQLAlchemy) won't actually help you that much if you enable the connection pool.
Does anyone have any experience with this?
I'm leery of enabling it because of the possibility that if some code does something stateful to a db connection and (perhaps mistakenly) doesn't clean up after itself, that state which would normally get cleaned up upon closing the connection will instead get propagated to subsequent code that gets a recycled connection.

There's no need to worry about residual state on a connection when using SQLA's connection pool, unless your application is changing connectionwide options like transaction isolation levels (which generally is not the case). SQLA's connection pool issues a connection.rollback() on the connection when its checked back in, so that any transactional state or locks are cleared.
It is possible that MySQL's connection time is pretty fast, especially if you're connecting over unix sockets on the same machine. If you do use a connection pool, you also want to ensure that connections are recycled after some period of time as MySQL's client library will shut down connections that are idle for more than 8 hours automatically (in SQLAlchemy this is the pool_recycle option).
You can quickly do some benching of connection pool vs. non with a SQLA application by changing the pool implementation from the default of QueuePool to NullPool, which is a pool implementation that doesn't actually pool anything - it connects and disconnects for real when the proxied connection is acquired and later closed.

Even if the connection part of MySQL itself is pretty slick, presumably there's still a network connection involved (whether that's loopback or physical). If you're making a lot of requests, that could get significantly expensive. It will depend (as is so often the case) on exactly what your application does, of course - if you're doing a lot of work per connection, then that will dominate and you won't gain a lot.
When in doubt, benchmark - but I would by-and-large trust that a connection pooling library (at least, a reputable one) should work properly and reset things appropriately.

Short answer: you need to benchmark it.
Long answer: it depends. MySQL is fast for connection setup, so avoiding that cost is not a good reason to go for connection pooling. Where you win there is if the queries run are few and fast because then you will see a win with pooling.
The other worry is how the application treats the SQL thread. If it does no SQL transactions, and makes no assumptions about the state of the thread, then pooling won't be a problem. OTOH, code that relies on the closing of the thread to discard temporary tables or to rollback transactions will have a lot of problems with pooling.

The connection pool speeds things up in that fact that you do not have create a java.sql.Connection object every time you do a database query. I use the Tomcat connection pool to a mysql database for web applications that do a lot of queries, during high user load there is noticeable speed improvement.

I made a simple RESTful service with Django and tested it with and without connection pooling. In my case, the difference was quite noticeable.
In a LAN, without it, response time was between 1 and 5 seconds. With it, less than 20 ms.
Results may vary, but the configuration I'm using for the MySQL & Apache servers is pretty standard low-end.
If you're serving UI pages over the internet the extra time may not be noticeable to the user, but in my case it was unacceptable, so I opted for using the pool. Hope this helps you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.