I'm not sure if I'm understanding the use case for DB connection pools (eg: psycopg2.pool and mysql.connector.pooling) in python. It seems to me that parallelism is usually achieved in python using a multi-process rather than a multi-thread approach because of the GIL, and that in the multi-process case these pools are not very useful since each process will initialize its own pool and will only have a single thread running at a time. Is this correct? Is there any strategy for sharing a DB connection pool when using multiple processes, and if not is the usefulness of pooling limited to multi-threaded python applications or are there other scenarios where you would use them?
Keith,
You're on the right track. As mentioned in the S.O post "Accessing a MySQL connection pool from Python multiprocessing,":
Making a seperate pool for each process is redundant and opens up way
too many connections.
Check out the other S.O post, "What is the best solution for database connection pooling in python?", it contains a sample pooling solution in python. This post also discusses the limitations of db-pooling if your application were to become multi-threaded:
Making your own connection pool is a BAD idea if your app ever decides to start using
multi-threading. Making a connection pool for a multi-threaded application is much
more complicated than one for a single-threaded application. You can use something
like PySQLPool in that case.
In-terms of implementing db pooling in python, as mentioned in "Application vs Database Resident Connection Pool," if your database supports it, the best implementation would involve:
Let connection pool be maintained and managed by database itself
(example: Oracle's DRCP) and calling modules just ask connections from the connection
broker described by Oracle DRCP.
Please let me know if you have any questions!
Related
We are building a Python microservices application with Posgresql as service datastore. At first glance Nameko seems a good starting point. However the Nameko documentation section on Concurrency includes this statement:
Nameko is built on top of the eventlet library, which provides concurrency via “greenthreads”. The concurrency model is co-routines with implicit yielding.
Implicit yielding relies on monkey patching the standard library, to trigger a yield when a thread waits on I/O. If your host services with nameko run on the command line, Nameko will apply the monkey patch for you.
Each worker executes in its own greenthread. The maximum number of concurrent workers can be tweaked based on the amount of time each worker will spend waiting on I/O.
Workers are stateless so are inherently thread safe, but dependencies should ensure they are unique per worker or otherwise safe to be accessed concurrently by multiple workers.
Note that many C-extensions that are using sockets and that would normally be considered thread-safe may not work with greenthreads. Among them are librabbitmq, MySQLdb and others.
Our architect is suggesting Nameko is therefore not going to fly - because although the pyscopg2 Postgresql driver is advertised as thread safe:
Its main features are the complete implementation of the Python DB API 2.0 specification and the thread safety (several threads can share the same connection). It was designed for heavily multi-threaded applications
It is clarified:
The above observations are only valid for regular threads: they don’t apply to forked processes nor to green threads. libpqconnections shouldn’t be used by a forked processes, so when using a module such as multiprocessingor a forking web deploy method such as FastCGI make sure to create the connections after the fork.
Connections shouldn’t be shared either by different green threads: see Support for coroutine librariesfor further details.
With a link clarifying:
Warning Psycopg connections are not green thread safe and can’t be used concurrently by different green threads. Trying to execute more than one command at time using one cursor per thread will result in an error (or a deadlock on versions before 2.4.2).
Therefore, programmers are advised to either avoid sharing connections between coroutines or to use a library-friendly lock to synchronize shared connections, e.g. for pooling.
The normal service configuration would have the service hold a repository with a connection shared by threads, with repository access methods using sessions on that connection scoped to the method.
Our architect suggest that even if we were to go with a connection+session per thread because of how the greenthreads work in terms of implicit yielding on a given session if we do other I/O operations between data access calls on the session e.g. file write via logging then we might suffer an implicite context switch - which then could cause issues on the session post the logging.
Is there any reasonable way we can use Nameko in this context or is it doomed as our architect suggests?
Is there any way we can make this work without having to write our own microservice code e.g. using Kombu?
Additional note: A comment on this page suggests regarding Database drivers states:
You may use any database driver compatible with SQLAlchemy provided it is safe to use with eventlet. This will include all pure-python drivers.
It goes on to list pysqlite & pymysql.
Would using either pg8000 or py-postgresql pure Python drivers put us in the clear threading wise - is the issue here greenthreads in combination with pyscopg2/3 driver that uses C-code or is it fundamentally Namekos use of greenthreads?
My python application uses concurrent.futures.ProcessPoolExecutor with 5 workers and each process makes multiple database queries.
Between the choice of giving each process its own db client, or alternatively , making all process to share a single client, which is considered more safe and conventional?
Short answer: Give each process (that needs it) its own db client.
Long answer: What problem are you trying to solve?
Sharing a DB client between processes basically doesn't happen; you'd have to have the one process which does have the DB client proxy the queries from the others, using more-or-less your own protocol. That can have benefits, if that protocol is specific to your application, but it will add complexity: you'll now have two different kinds of workers in your program, rather than just one kind, plus the protocol between them. You'd want to make sure that the benefits outweigh the additional complexity.
Sharing a DB client between threads is usually possible; you'd have to check the documentation to see which objects and operations are "thread-safe". However, since your application is otherwise CPU-heavy, threading is not suitable, due to Python limitations (the GIL).
At the same time, there's little cost to having a DB client in each process; you will in any case need some sort of client, it might as well be the direct one.
There isn't going to be much more IO, since that's mostly based on the total number of queries and amount of data, regardless of whether that comes from one process or gets spread among several. The only additional IO will be in the login, and that's not much.
If you're running out of connections at the database, you can either tune/upgrade your database for more connections, or use a separate off-the-shelf "connection pooler" to share them; that's likely to be much better than trying to implement a connection pooler from scratch.
More generally, and this applies well beyond this particular question, it's often better to combine several off-the-shelf pieces in a straightforward way, than it is to try to put together a custom complex piece that does the whole thing all at once.
So, what problem are you trying to solve?
It is better to use multithreading or asynchronous approach instead of multiprocessing because it will consume fewer resources. That way you could use a single db connection, but I would recommend creating a separate session for each worker or coroutine to avoid some exceptions or problems with locking.
I want to create and use a connection pool for an app I'm making. I'm trying to figure out if I need to close the connection pool itself when exiting the app or not. I know how to and when to close the connections to the connection pool. I can't seem to find an answer as to whether I need to close the connection pool itself, or how to do it for that matter. The actual connection to SQL, not the connections to the pool.
Sure thing, no worries! Just grab SQLAlchemy's Connection Pools and go to town.
Really, though, unless you're doing this for exercise and are willing to think through all of the corner cases (and as you can tell from the length of that manual page, they're not few), don't implement connection pooling yourself.
I'm programming a bit of server code and the MQTT side of it runs in it's own thread using the threading module which works great and no issues but now I'm wondering how to proceed.
I have two MariaDB databases, one of them is local and the other is remote (There is a good and niche reason for this.) and I'm writing a class which handles the databases. This class will start new threads of classes that submits the data to their respected databases. If conditions are true, then it tells the data to start a new thread to push data to one database, if they are false, the data will go to the other database. The MQTT thread has a instance of the "Database handler" class and passes data to it through different calling functions within the class.
Will this work to allow a thread to concentrate on MQTT tasks while another does the database work? There are other threads as well, I've just never combined databases and threads before so I'd like an opinion or any information that would help me out from more seasoned programmers.
Writing code that is "thread safe" can be tricky. I doubt if the Python connector to MySQL is thread safe; there is very little need for it.
MySQL is quite happy to have multiple connections to it from clients. But they must be separate connections, not the same connection running in separate threads.
Very few projects need multi-threaded access to the database. Do you have a particular need? If so let's hear about it, and discuss the 'right' way to do it.
For now, each of your threads that needs to talk to the database should create its own connection. Generally, such a connection can be created soon after starting the thread (or process) and kept open until close to the end of the thread. That is, normally you should have only one connection per thread.
In C, when making a networking client / server setup, I usually have to do some standard BSD socket setup. Then on the server side, I'll have to manage multiple threads, usually a main thread, an a io thread. Each connection is managed by a connection manager so that you can have connections being processed while new requests are coming in.
What are some good ways to do connection management in C? Are there well know libraries to handle all of this? I know about Boost for C++, but I'm interested in C and Python.
Thanks,
Chenz
P.S. Sorry about the not so thought out question. I'll try and polish it up soon.
Personally, I am not a huge fan of the one-thread-per-connection model with synchronous IO. I prefer X threads with a pool of Y connections with asynchronous IO. You can spawn threads as needed, or round robin the connections as they come in to a pre-allocated pool.
If you want to be really tricky, spawn threads with lifetime management, where new connections go to the newest spawned thread so the old thread can be killed off. That way if a thread holds on to a resource, when it is cleaned up the resource will be released.
You may want to look at select, poll, epoll, completion pools and AIO.
Most of these are wrapped up in libevent.