Applications often need to connect to other services (a database, a cache, an API, etc). For sanity and DRY, we'd like to keep all of these connections in one module so the rest of our code base can share connections.
To reduce boilerplate, downstream usage should be simple:
# app/do_stuff.py
from .connections import AwesomeDB
db = AwesomeDB()
def get_stuff():
return db.get('stuff')
And setting up the connection should also be simple:
# app/cli.py or some other main entry point
from .connections import AwesomeDB
db = AwesomeDB()
db.init(username='stuff admin') # Or os.environ['DB_USER']
Web frameworks like Django and Flask do something like this, but it feels a bit clunky:
Connect to a Database in Flask, Which Approach is better?
http://flask.pocoo.org/docs/0.10/tutorial/dbcon/
One big issue with this is that we want a reference to the actual connection object instead of a proxy, because we want to retain tab-completion in iPython and other dev environments.
So what's the Right Way (tm) to do it? After a few iterations, here's my idea:
#app/connections.py
from awesome_database import AwesomeDB as RealAwesomeDB
from horrible_database import HorribleDB as RealHorribleDB
class ConnectionMixin(object):
__connection = None
def __new__(cls):
cls.__connection = cls.__connection or object.__new__(cls)
return cls.__connection
def __init__(self, real=False, **kwargs):
if real:
super().__init__(**kwargs)
def init(self, **kwargs):
kwargs['real'] = True
self.__init__(**kwargs)
class AwesomeDB(ConnectionMixin, RealAwesomeDB):
pass
class HorribleDB(ConnectionMixin, RealHorribleDB):
pass
Room for improvement: Set initial __connection to a generic ConnectionProxy instead of None, which catches all attribute access and throws an exception.
I've done quite a bit of poking around here on SO and in various OSS projects and haven't seen anything like this. It feels pretty solid, though it does mean a bunch of modules will be instantiating connection objects as a side effect at import time. Will this blow up in my face? Are there any other negative consequences to this approach?
First, design-wise, I might be missing something, but I don't see why you need the heavy mixin+singleton machinery instead of just defining a helper like so:
_awesome_db = None
def awesome_db(**overrides):
global _awesome_db
if _awesome_db is None:
# Read config/set defaults.
# overrides.setdefault(...)
_awesome_db = RealAwesomeDB(**overrides)
return _awesome_db
Also, there is a bug that might not look like a supported use-case, but anyway: if you make the following 2 calls in a row, you would wrongly get the same connection object twice even though you passed different parameters:
db = AwesomeDB()
db.init(username='stuff admin')
db = AwesomeDB()
db.init(username='not-admin') # You'll get admin connection here.
An easy fix for that would be to use a dict of connections keyed on the input parameters.
Now, on the essence of the question.
I think the answer depends on how your "connection" classes are actually implemented.
Potential downsides with your approach I see are:
In a multithreaded environment you could get problems with unsychronized concurrent access to the global connection object from multiple threads, unless it is already thread-safe. If you care about that, you could change your code and interface a bit and use a thread-local variable.
What if a process forks after creating the connection? Web application servers tend to do that and it might not be safe, again depending on the underlying connection.
Does the connection object have state? What happens if the connection object becomes invalid (due to i.e. connection error/time out)? You might need to replace the broken connection with a new one to return the next time a connection is requested.
Connection management is often already efficiently and safely implemented through a connection pool in client libraries.
For example, the redis-py Redis client uses the following implementation:
https://github.com/andymccurdy/redis-py/blob/1c2071762ad9b9288e786665990083e61c1cf355/redis/connection.py#L974
The Redis client then uses the connection pool like so:
Requests a connection from the connection pool.
Tries to execute a command on the connection.
If the connection fails, the client closes it.
In any case, finaly it is returned to the connection pool so it can be reused by subsequent calls or other threads.
So since the Redis client handles all of that under the hood, you can safely do what you want directly. Connections will be lazily created until the connection pool reaches full capacity.
# app/connections.py
def redis_client(**kwargs):
# Maybe read configuration/set default arguments
# kwargs.setdefault()
return redis.Redis(**kwargs)
Similarly, SQLAlchemy can use connection pooling as well.
To summarize, my understanding is that:
If your client library supports connection pooling, you don't need to do anything special to share connections between modules and even threads. You could just define a helper similar to redis_client() that reads configuration, or specifies default parameters.
If your client library provides only low-level connection objects, you will need to make sure access to them is thread-safe and fork-safe. Also, you need to make sure each time you return a valid connection (or raise an exception if you can't establish or reuse an existing one).
Related
Supposing I have a replicated master-slave Redis setup, how do I actually access it from a client library? Presumably I need a client instance per each host and I need to decide which I want to use for writing and reading, like:
master = redis.Redis(master_host, port)
clients = [redis.Redis(host, port) for host in slave_hosts]
master.set('key', 'value')
client = random.choice(clients)
client.get('other-key')
I was thinking there should be some magic in the library where I could provide a list of hosts for making such routing automatic, but couldn't find it.
I've also looked into redis-cluster and redis-sentinel and they all start by talking about automatic failovers when slaves become masters, and I'm not sure it's what I need. What I need is a consistent master which I can afford to lose for some time (I can hold up updates in a queue).
Are you intentionally splitting reads and writes, and doing so because you already know you will be overwhelming the Redis instance? If not, don't worry about splitting r/w between servers. Use Sentinel from the client as a lookup to see what node is the master and connect to it to do all of your reads and writes.
Who you do eventually have to split reads off your code will need to written such that you establish a connection for each read-slave and only send reads to it. You'll need to detect failovers to redistribute your r/w split.
Now if you have a separate read-only process, you can either let it query sentinel for slave in place of master, or you can set up a non-promotable slave to use for that process - though if it goes down you'd lose your access.
You Aint Gonna Need It (YAGNI) is a good principle to follow here, as is avoiding premature optimization. A single Redis instance can be incredibly fast and doesn't suffer the same performance drops due to highly complex queries you find in traditional SQL datastore. So I would recommend that absent data you run with a standard setup where you simply query sentinel for current master and use the one it returns.
Sentinel can do that:
You can also create Redis client connections from a Sentinel instance. You can connect to either the master (for write operations) or a slave (for read-only operations).
>>> master = sentinel.master_for('mymaster', socket_timeout=0.1)
>>> slave = sentinel.slave_for('mymaster', socket_timeout=0.1)
>>> master.set('foo', 'bar')
>>> slave.get('foo')
'bar'
Read more Redis python client
I am using Redis as well. In my case I have a wrapper class like:
class RedisConnection:
def __init__(self, master_host, slave_host):
self.__master = redis.Redis...
self.__slave == redis.Redis...
def write_xxx(self, value):
self.__master.set(...)
def read_xxx(self):
self.__slave.get(...)
I need to create a connection pool to database which can be reused by requests in Flask. The documentation(0.11.x) suggests to use g, the application context to store the database connections.
The issue is application context is created and destroyed before and after each request. Thus, there is no limit on the number of connection being created and no connection is getting reused. The code I am using is:
def get_some_connection():
if not hasattr(g, 'some_connection'):
logger.info('creating connection')
g.some_connection = SomeConnection()
return g.some_connection
and to close the connection
#app.teardown_appcontext
def destroy_some_connection(error):
logger.info('destroying some connection')
g.some_connection.close()
Is this intentional, that is, flask want to create a fresh connection everytime, or there is some issue with the use of application context in my code. Also, if its intentional is there a workaround to have the connection global. I see, some of the old extensions keep the connection in app['extension'] itself.
No, you'll have to have some kind of global connection pool. g lets you share state across one request, so between various templates and functions called while handling one request, without having to pass that 'global' state around, but it is not meant to be a replacement for module-global variables (which have the same lifetime as the module).
You can certainly set the database connection onto g to ensure all of your request code uses just the one connection, but you are still free to draw the connection from a (module) global pool.
I recommend you create connections per thread and pool these. You can either build this from scratch (use a threading.local object perhaps), or you can use a project like SQLAlchemy which comes with excellent connection pool implementations. This is basically what the Flask-SQLAlchemy extension does.
I have a Model class which is part of my self-crafted ORM. It has all kind of methods like save(), create() and so on. Now, the thing is that all these methods require a connection object to act properly. And I have no clue on what's the best approach to feed a Model object with a connection object.
What I though of so far:
provide a connection object in a Model's __init__(); this will work, by setting an instance variable and use it throughout the methods, but it will kind of break the API; users shouldn't always feed a connection object when they create a Model object;
create the connection object separately, store it somewhere (where?) and on Model's __init__() get the connection from where it has been stored and put it in an instance variable (this is what I thought to be the best approach, but have no idea of the best spot to store that connection object);
create a connection pool which will be fed with the connection object, then on Model's __init__() fetch the connection from the connection pool (how do I know which connection to fetch from the pool?).
If there are any other approached, please do tell. Also, I would like to know which is the proper way to this.
Here's how I would do:
Use a connection pool with a queue interface. You don't have to choose a connection object, you just pick the next on the line. This can be done whenever you need transaction, and put back afterwards.
Unless you have some very specific needs, I would use a Singleton class for the database connection. No need to pass parameters on the constructor every time.
For testing, you just put a mocked database connection on the Singleton class.
Edit:
About the connection pool questions (I could be wrong here, but it would be my first try):
Keep all connections open. Pop when you need, put when you don't need it anymore, just like a regular queue. This queue could be exposed from the Singleton.
You start with a fixed, default number of connections (like 20). You could override the pop method, so when the queue is empty you block (wait for another to free if the program is multi-threaded) or create a new connection on the fly.
Destroying connections is more subtle. You need to keep track of how many connections the program is using, and how likely it is you have too many connections. Take care, because destroying a connection that will be needed later slows the program down. In the end, it's a n heuristic problem that changes the performance characteristics.
I'm running a Flask-based web app that uses Mongodb (with Pymongo for use in Python). Nearly every view access the database, so I want to make the most effective use of memory and CPU resources. I'm unsure what the most efficient method is for instantiating pymongo's Connection() object, which is used access and manipulate the database. Right now, I declare from pymongo import Connection at the top of my file, and then at the beginning of each view function I have:
def sampleViewFunction():
myCollection = Connection()['myDB']['myCollection']
## then use myCollection to manipulation the database
## more code...
The other way I could do it is declare at the top of my file:
from pymongo import Connection
myCollection = Connection()['myD']['myCollection']
And then later on, your code would just read:
def sampleViewFunction():
## no declaration of myCollection since it's a global variable
## then use myCollection to manipulation the database
## more code...
So the only difference is the declaration scope of myCollection. How do these two methods differ in the way memory is handled and CPU consumption? Since this is a web application, I'm thinking about scenarios where multiple users are the site simultaneously. I imagine there's a difference in the lifespan of the connection to the database, which I'm guessing could impact performance.
You should use the second method. When you create a connection in pymongo you by default create a connection pool. See the documentation for more details see here. This is the correct way of doing things. The default max_pool_size is 10 so this will give you 10 connections to your mongod instance(s). If you did it the other way and created a pool per function call you would
Be creating and destroying a connection with each function call which is wasteful of resources - both RAM and CPU.
Have no control over how many connections your code is going to create to the mongod - you could flood the mongod with connections
I have developed some custom DAO-like classes to meet some very specialized requirements for my project that is a server-side process that does not run inside any kind of framework.
The solution works great except that every time a new request is made, I open a new connection via MySQLdb.connect.
What is the best "drop in" solution to switch this over to using connection pooling in python? I am imagining something like the commons DBCP solution for Java.
The process is long running and has many threads that need to make requests, but not all at the same time... specifically they do quite a lot of work before brief bursts of writing out a chunk of their results.
Edited to add:
After some more searching I found anitpool.py which looks decent, but as I'm relatively new to python I guess I just want to make sure I'm not missing a more obvious/more idiomatic/better solution.
In MySQL?
I'd say don't bother with the connection pooling. They're often a source of trouble and with MySQL they're not going to bring you the performance advantage you're hoping for. This road may be a lot of effort to follow--politically--because there's so much best practices hand waving and textbook verbiage in this space about the advantages of connection pooling.
Connection pools are simply a bridge between the post-web era of stateless applications (e.g. HTTP protocol) and the pre-web era of stateful long-lived batch processing applications. Since connections were very expensive in pre-web databases (since no one used to care too much about how long a connection took to establish), post-web applications devised this connection pool scheme so that every hit didn't incur this huge processing overhead on the RDBMS.
Since MySQL is more of a web-era RDBMS, connections are extremely lightweight and fast. I have written many high volume web applications that don't use a connection pool at all for MySQL.
This is a complication you may benefit from doing without, so long as there isn't a political obstacle to overcome.
IMO, the "more obvious/more idiomatic/better solution" is to use an existing ORM rather than invent DAO-like classes.
It appears to me that ORM's are more popular than "raw" SQL connections. Why? Because Python is OO, and the mapping from a SQL row to an object is absolutely essential. There aren't many use cases where you deal with SQL rows that don't map to Python objects.
I think that SQLAlchemy or SQLObject (and the associated connection pooling) are the more idiomatic Pythonic solutions.
Pooling as a separate feature isn't very common because pure SQL (without object mapping) isn't very popular for the kind of complex, long-running processes that benefit from connection pooling. Yes, pure SQL is used, but it's always used in simpler or more controlled applications where pooling isn't helpful.
I think you might have two alternatives:
Revise your classes to use SQLAlchemy or SQLObject. While this appears painful at first (all that work wasted), you should be able to leverage all the design and thought. It's merely an exercise in adopting a widely-used ORM and pooling solution.
Roll out your own simple connection pool using the algorithm you outlined -- a simple Set or List of connections that you cycle through.
Wrap your connection class.
Set a limit on how many connections you make.
Return an unused connection.
Intercept close to free the connection.
Update:
I put something like this in dbpool.py:
import sqlalchemy.pool as pool
import MySQLdb as mysql
mysql = pool.manage(mysql)
Old thread, but for general-purpose pooling (connections or any expensive object), I use something like:
def pool(ctor, limit=None):
local_pool = multiprocessing.Queue()
n = multiprocesing.Value('i', 0)
#contextlib.contextmanager
def pooled(ctor=ctor, lpool=local_pool, n=n):
# block iff at limit
try: i = lpool.get(limit and n.value >= limit)
except multiprocessing.queues.Empty:
n.value += 1
i = ctor()
yield i
lpool.put(i)
return pooled
Which constructs lazily, has an optional limit, and should generalize to any use case I can think of. Of course, this assumes that you really need the pooling of whatever resource, which you may not for many modern SQL-likes. Usage:
# in main:
my_pool = pool(lambda: do_something())
# in thread:
with my_pool() as my_obj:
my_obj.do_something()
This does assume that whatever object ctor creates has an appropriate destructor if needed (some servers don't kill connection objects unless they are closed explicitly).
I've just been looking for the same sort of thing.
I've found pysqlpool and the sqlalchemy pool module
Replying to an old thread but the last time I checked, MySQL offers connection pooling as part of its drivers.
You can check them out at :
https://dev.mysql.com/doc/connector-python/en/connector-python-connection-pooling.html
From TFA, Assuming you want to open a connection pool explicitly (as OP had stated):
dbconfig = { "database": "test", "user":"joe" }
cnxpool = mysql.connector.pooling.MySQLConnectionPool(pool_name = "mypool",pool_size = 3, **dbconfig)
This pool is then accessed by requesting from the pool through the get_connection() function.
cnx1 = cnxpool.get_connection()
cnx2 = cnxpool.get_connection()
Making your own connection pool is a BAD idea if your app ever decides to start using multi-threading. Making a connection pool for a multi-threaded application is much more complicated than one for a single-threaded application. You can use something like PySQLPool in that case.
It's also a BAD idea to use an ORM if you're looking for performance.
If you'll be dealing with huge/heavy databases that have to handle lots of selects, inserts,
updates and deletes at the same time, then you're going to need performance, which means you'll need custom SQL written to optimize lookups and lock times. With an ORM you don't usually have that flexibility.
So basically, yeah, you can make your own connection pool and use ORMs but only if you're sure you won't need anything of what I just described.
Use DBUtils, simple and reliable.
pip install DBUtils
i did it for opensearch so you can refer it.
from opensearchpy import OpenSearch
def get_connection():
connection = None
try:
connection = OpenSearch(
hosts=[{'host': settings.OPEN_SEARCH_HOST, 'port': settings.OPEN_SEARCH_PORT}],
http_compress=True,
http_auth=(settings.OPEN_SEARCH_USER, settings.OPEN_SEARCH_PASSWORD),
use_ssl=True,
verify_certs=True,
ssl_assert_hostname=False,
ssl_show_warn=False,
)
except Exception as error:
print("Error: Connection not established {}".format(error))
else:
print("Connection established")
return connection
class OpenSearchClient(object):
connection_pool = []
connection_in_use = []
def __init__(self):
if OpenSearchClient.connection_pool:
pass
else:
OpenSearchClient.connection_pool = [get_connection() for i in range(0, settings.CONNECTION_POOL_SIZE)]
def search_data(self, query="", index_name=settings.OPEN_SEARCH_INDEX):
available_cursor = OpenSearchClient.connection_pool.pop(0)
OpenSearchClient.connection_in_use.append(available_cursor)
response = available_cursor.search(body=query, index=index_name)
available_cursor.close()
OpenSearchClient.connection_pool.append(available_cursor)
OpenSearchClient.connection_in_use.pop(-1)
return response