Connection Pooling in Mongo with PyMongo - python

I would like to makeup a setup to store my mongodb connections as 3-4 threads on Memory by which it will act as a pool of connections. I don't want to create a connection everytime when my core functions work which does some db queries. I am thinking on this way like If I have a pool of connections then my core functions will take available connections(available threads) from the pool, use it and release it back to the pool.
Does it make any sense? Is it possible to achieve this?
I know that mongodb internally do have connection pooling, but I would like to the above mentioned stuff on top of it.

Your question doesn't really make sense. There's no need to do anything like have connection pooling on top of existing connection pooling.
PyMongo docs: All that really matters is whether your application uses multithreading or multiprocessing.
multithreading is already handled by the connection pools and you only need one client instance.
multiprocessing requires you to create a new instance for each process to avoid any deadlock issues.
Your app is likely multithreaded, so all you should need to do is have a global instance of your db connection and reuse it for each db query. PyMongo will take care of the rest.

Related

Reusing database connection for multiple requests

If I don't need transactions, can I reuse the same database connection for multiple requests?
Flask documentation says:
Because database connections encapsulate a transaction, we also need to make sure that only one request at the time uses the connection.
Here's how I understand the meaning of the above sentence:
Python DB-API connection can only handle one transaction at a time; to start a new transaction, one must first commit or roll back the previous one. So if each of our requests needs its own transaction, then of course each request needs its own database connection.
Please let me know if I got it wrong.
But let's say I set autocommit mode, and handle each request in a single SQL statement. Or, alternatively, let's say I only read - not write - to the database. In either case, it seems I can just reuse the same database connection for all my requests to save the overhead of multiple connections. But I'm not sure if there's any downside to this approach.
Edit: I can see one issue with what I'm proposing: each request might be handled by a different process. Since connections should probably not be reused across processes, let me clarify my question: I mean creating one connection per process, and using it for all requests that happen to be handled by this process.
On the other hand, the whole point of (green or native) threads is usually to serve one request per thread, so my proposed approach implies sharing connection across threads. It seems one connection can be used concurrently in multiple native threads, but not in multiple green threads.
So let's say for concreteness my environment is flask + gunicorn with multiple multi-threaded sync workers.
Based on #Craig Ringer comment on a different question, I think I know the answer.
The only possible advantage of connection sharing is performance (other factors - like transaction encapsulation and simplicity - favor a separate connection per request). And since a connection can't be shared across processes or green threads, it only has a chance with native threads. But psycopg2 (and presumably other drivers) doesn't allow concurrent access from the same connection. So unless each request spends very little time talking to the database, there is likely a performance hit, not benefit, from connection sharing.

How to manage db connection especially in case of multi threading

I am working on an online judge.I am using python 2.7 and Mysql ( as I am working on back end-part)
My Method:
I create a main thread which pulls out submissions from database( 10 at a time) and puts them in a queue.Then I have multiple threads that take submissions from queue, evaluate it and write the result back to database.
Now I have some doubts(I know they are doubts from different topics but approach to some of them also is highly appreciated).
Currently when I start the threads I give them their own db connections, Which they use.Is this a good practice to give one connection per thread. Does sharing of connections between threads create problems.How do I go about this.
My main thread uses a single connection as its only work is to pull submissions from db and put then in queue(also update their status in db to Assessing Submission). But sometimes I get the error: Lost connection to Mysql server while querying. I keep getting it even when I stop the program and start it again.What do I do about it? Also should I implement a Pool of connections for only the main thread?
Also does a db connection stay alive for ever? What to do when its session memory etc gets exhausted how to handle that?
Use a connection pool. Sharing the database connection is not always bad but you have to be careful about it. You can try SQLAlchemy to manage a lot of this for you: http://docs.sqlalchemy.org/en/rel_0_8/orm/session.html#unitofwork-contextual
The server might be out of connections, your connection might have been killed because it uses too many resources.. etc. A connection pool could help you solve this.
It all depends, it could stay alive indefinitely theoretically, but usually you have a timeout somewhere.
If you give the same connection to every thread then the threads will not be able to query the database and race condition will occur. So you need to provide separate connection to every thread and indeed it is a good idea. Use a Connection Pool for the purpose it will help you get different connections.
Connection Pool will surely help.
Release the connection once your work is over. There is a limit to connection which is termed as connection time out. So you need to use some third party library to handle that, c3p0 is a good library which can help you in this.
Please refer the below link to configure it:
Best configuration of c3p0

How to properly handle Redis connections with Python / Python RQ?

What is the best pattern to handle Redis connections (both for interacting with Redis directly and indirectly through Python-RQ)?
Generally, database connections need to be closed / returned to a pool when done, but I don't see how to do that with redis-py. That makes me wonder if I'm doing it the wrong way.
Also, I have seen some performance dips when enqueuing jobs to RQ, which I've been told might relate to poor connection usage / reuse.
Basically, I'm interested in knowing the correct pattern, so I can either verify or correct what we have in our application.
Thanks a lot! If there's more information that would be helpful, let me know.
Behind the scenes, redis-py uses a connection pool to manage connections to a Redis server. By default, each Redis instance you create will in turn create its own connection pool. You can override this behavior and use an existing connection pool by passing an already created connection pool instance to the connection_pool argument of the Redis class. You may choose to do this in order to implement client side sharding or have finer grain control of how connections are managed
pool = redis.ConnectionPool(host='localhost', port=6379, db=0)
r = redis.Redis(connection_pool=pool)

Mysql connection pooling question: is it worth it?

I recall hearing that the connection process in mysql was designed to be very fast compared to other RDBMSes, and that therefore using a library that provides connection pooling (SQLAlchemy) won't actually help you that much if you enable the connection pool.
Does anyone have any experience with this?
I'm leery of enabling it because of the possibility that if some code does something stateful to a db connection and (perhaps mistakenly) doesn't clean up after itself, that state which would normally get cleaned up upon closing the connection will instead get propagated to subsequent code that gets a recycled connection.
There's no need to worry about residual state on a connection when using SQLA's connection pool, unless your application is changing connectionwide options like transaction isolation levels (which generally is not the case). SQLA's connection pool issues a connection.rollback() on the connection when its checked back in, so that any transactional state or locks are cleared.
It is possible that MySQL's connection time is pretty fast, especially if you're connecting over unix sockets on the same machine. If you do use a connection pool, you also want to ensure that connections are recycled after some period of time as MySQL's client library will shut down connections that are idle for more than 8 hours automatically (in SQLAlchemy this is the pool_recycle option).
You can quickly do some benching of connection pool vs. non with a SQLA application by changing the pool implementation from the default of QueuePool to NullPool, which is a pool implementation that doesn't actually pool anything - it connects and disconnects for real when the proxied connection is acquired and later closed.
Even if the connection part of MySQL itself is pretty slick, presumably there's still a network connection involved (whether that's loopback or physical). If you're making a lot of requests, that could get significantly expensive. It will depend (as is so often the case) on exactly what your application does, of course - if you're doing a lot of work per connection, then that will dominate and you won't gain a lot.
When in doubt, benchmark - but I would by-and-large trust that a connection pooling library (at least, a reputable one) should work properly and reset things appropriately.
Short answer: you need to benchmark it.
Long answer: it depends. MySQL is fast for connection setup, so avoiding that cost is not a good reason to go for connection pooling. Where you win there is if the queries run are few and fast because then you will see a win with pooling.
The other worry is how the application treats the SQL thread. If it does no SQL transactions, and makes no assumptions about the state of the thread, then pooling won't be a problem. OTOH, code that relies on the closing of the thread to discard temporary tables or to rollback transactions will have a lot of problems with pooling.
The connection pool speeds things up in that fact that you do not have create a java.sql.Connection object every time you do a database query. I use the Tomcat connection pool to a mysql database for web applications that do a lot of queries, during high user load there is noticeable speed improvement.
I made a simple RESTful service with Django and tested it with and without connection pooling. In my case, the difference was quite noticeable.
In a LAN, without it, response time was between 1 and 5 seconds. With it, less than 20 ms.
Results may vary, but the configuration I'm using for the MySQL & Apache servers is pretty standard low-end.
If you're serving UI pages over the internet the extra time may not be noticeable to the user, but in my case it was unacceptable, so I opted for using the pool. Hope this helps you.

What is the best solution for database connection pooling in python?

I have developed some custom DAO-like classes to meet some very specialized requirements for my project that is a server-side process that does not run inside any kind of framework.
The solution works great except that every time a new request is made, I open a new connection via MySQLdb.connect.
What is the best "drop in" solution to switch this over to using connection pooling in python? I am imagining something like the commons DBCP solution for Java.
The process is long running and has many threads that need to make requests, but not all at the same time... specifically they do quite a lot of work before brief bursts of writing out a chunk of their results.
Edited to add:
After some more searching I found anitpool.py which looks decent, but as I'm relatively new to python I guess I just want to make sure I'm not missing a more obvious/more idiomatic/better solution.
In MySQL?
I'd say don't bother with the connection pooling. They're often a source of trouble and with MySQL they're not going to bring you the performance advantage you're hoping for. This road may be a lot of effort to follow--politically--because there's so much best practices hand waving and textbook verbiage in this space about the advantages of connection pooling.
Connection pools are simply a bridge between the post-web era of stateless applications (e.g. HTTP protocol) and the pre-web era of stateful long-lived batch processing applications. Since connections were very expensive in pre-web databases (since no one used to care too much about how long a connection took to establish), post-web applications devised this connection pool scheme so that every hit didn't incur this huge processing overhead on the RDBMS.
Since MySQL is more of a web-era RDBMS, connections are extremely lightweight and fast. I have written many high volume web applications that don't use a connection pool at all for MySQL.
This is a complication you may benefit from doing without, so long as there isn't a political obstacle to overcome.
IMO, the "more obvious/more idiomatic/better solution" is to use an existing ORM rather than invent DAO-like classes.
It appears to me that ORM's are more popular than "raw" SQL connections. Why? Because Python is OO, and the mapping from a SQL row to an object is absolutely essential. There aren't many use cases where you deal with SQL rows that don't map to Python objects.
I think that SQLAlchemy or SQLObject (and the associated connection pooling) are the more idiomatic Pythonic solutions.
Pooling as a separate feature isn't very common because pure SQL (without object mapping) isn't very popular for the kind of complex, long-running processes that benefit from connection pooling. Yes, pure SQL is used, but it's always used in simpler or more controlled applications where pooling isn't helpful.
I think you might have two alternatives:
Revise your classes to use SQLAlchemy or SQLObject. While this appears painful at first (all that work wasted), you should be able to leverage all the design and thought. It's merely an exercise in adopting a widely-used ORM and pooling solution.
Roll out your own simple connection pool using the algorithm you outlined -- a simple Set or List of connections that you cycle through.
Wrap your connection class.
Set a limit on how many connections you make.
Return an unused connection.
Intercept close to free the connection.
Update:
I put something like this in dbpool.py:
import sqlalchemy.pool as pool
import MySQLdb as mysql
mysql = pool.manage(mysql)
Old thread, but for general-purpose pooling (connections or any expensive object), I use something like:
def pool(ctor, limit=None):
local_pool = multiprocessing.Queue()
n = multiprocesing.Value('i', 0)
#contextlib.contextmanager
def pooled(ctor=ctor, lpool=local_pool, n=n):
# block iff at limit
try: i = lpool.get(limit and n.value >= limit)
except multiprocessing.queues.Empty:
n.value += 1
i = ctor()
yield i
lpool.put(i)
return pooled
Which constructs lazily, has an optional limit, and should generalize to any use case I can think of. Of course, this assumes that you really need the pooling of whatever resource, which you may not for many modern SQL-likes. Usage:
# in main:
my_pool = pool(lambda: do_something())
# in thread:
with my_pool() as my_obj:
my_obj.do_something()
This does assume that whatever object ctor creates has an appropriate destructor if needed (some servers don't kill connection objects unless they are closed explicitly).
I've just been looking for the same sort of thing.
I've found pysqlpool and the sqlalchemy pool module
Replying to an old thread but the last time I checked, MySQL offers connection pooling as part of its drivers.
You can check them out at :
https://dev.mysql.com/doc/connector-python/en/connector-python-connection-pooling.html
From TFA, Assuming you want to open a connection pool explicitly (as OP had stated):
dbconfig = { "database": "test", "user":"joe" }
cnxpool = mysql.connector.pooling.MySQLConnectionPool(pool_name = "mypool",pool_size = 3, **dbconfig)
This pool is then accessed by requesting from the pool through the get_connection() function.
cnx1 = cnxpool.get_connection()
cnx2 = cnxpool.get_connection()
Making your own connection pool is a BAD idea if your app ever decides to start using multi-threading. Making a connection pool for a multi-threaded application is much more complicated than one for a single-threaded application. You can use something like PySQLPool in that case.
It's also a BAD idea to use an ORM if you're looking for performance.
If you'll be dealing with huge/heavy databases that have to handle lots of selects, inserts,
updates and deletes at the same time, then you're going to need performance, which means you'll need custom SQL written to optimize lookups and lock times. With an ORM you don't usually have that flexibility.
So basically, yeah, you can make your own connection pool and use ORMs but only if you're sure you won't need anything of what I just described.
Use DBUtils, simple and reliable.
pip install DBUtils
i did it for opensearch so you can refer it.
from opensearchpy import OpenSearch
def get_connection():
connection = None
try:
connection = OpenSearch(
hosts=[{'host': settings.OPEN_SEARCH_HOST, 'port': settings.OPEN_SEARCH_PORT}],
http_compress=True,
http_auth=(settings.OPEN_SEARCH_USER, settings.OPEN_SEARCH_PASSWORD),
use_ssl=True,
verify_certs=True,
ssl_assert_hostname=False,
ssl_show_warn=False,
)
except Exception as error:
print("Error: Connection not established {}".format(error))
else:
print("Connection established")
return connection
class OpenSearchClient(object):
connection_pool = []
connection_in_use = []
def __init__(self):
if OpenSearchClient.connection_pool:
pass
else:
OpenSearchClient.connection_pool = [get_connection() for i in range(0, settings.CONNECTION_POOL_SIZE)]
def search_data(self, query="", index_name=settings.OPEN_SEARCH_INDEX):
available_cursor = OpenSearchClient.connection_pool.pop(0)
OpenSearchClient.connection_in_use.append(available_cursor)
response = available_cursor.search(body=query, index=index_name)
available_cursor.close()
OpenSearchClient.connection_pool.append(available_cursor)
OpenSearchClient.connection_in_use.pop(-1)
return response

Categories

Resources