I'm running a Flask-based web app that uses Mongodb (with Pymongo for use in Python). Nearly every view access the database, so I want to make the most effective use of memory and CPU resources. I'm unsure what the most efficient method is for instantiating pymongo's Connection() object, which is used access and manipulate the database. Right now, I declare from pymongo import Connection at the top of my file, and then at the beginning of each view function I have:
def sampleViewFunction():
myCollection = Connection()['myDB']['myCollection']
## then use myCollection to manipulation the database
## more code...
The other way I could do it is declare at the top of my file:
from pymongo import Connection
myCollection = Connection()['myD']['myCollection']
And then later on, your code would just read:
def sampleViewFunction():
## no declaration of myCollection since it's a global variable
## then use myCollection to manipulation the database
## more code...
So the only difference is the declaration scope of myCollection. How do these two methods differ in the way memory is handled and CPU consumption? Since this is a web application, I'm thinking about scenarios where multiple users are the site simultaneously. I imagine there's a difference in the lifespan of the connection to the database, which I'm guessing could impact performance.
You should use the second method. When you create a connection in pymongo you by default create a connection pool. See the documentation for more details see here. This is the correct way of doing things. The default max_pool_size is 10 so this will give you 10 connections to your mongod instance(s). If you did it the other way and created a pool per function call you would
Be creating and destroying a connection with each function call which is wasteful of resources - both RAM and CPU.
Have no control over how many connections your code is going to create to the mongod - you could flood the mongod with connections
Related
I am using sqlite3 in a flask app (actually connexion).
I would like to stay in-memory but keep the db between the requests to the server.
So it should be destroyed after server is killed
When I use sqlite3.connect(':memory:') the db is destroyed after each response
So I followed this approach In memory SQLite3 shared database python and run sqlite3.connect('file::memory:?cache=shared&mode=memory', uri=True). But then, a file called file::memory:?cache=shared&mode=memory appears in the app root and does not disappear when I kill the server. When I start the server again, the db-init routine which creates the tables fails, because the tables are already created.
I tried this out on linux and Mac. Both have same behaviour. It seems like the db is saved to file instead of being mapped to memory.
My python version is 3.9 and sqlite3.sqlite_version_info is (3, 37, 0)
I am suspecting that sqlite is treating this 'file::memory:?cache=shared&mode=memory' as a file name. Therefore on execution, creates a database file with that "name", in it's root directory.
Now to the issue I would try connecting via:
sqlite3.connect(':memory:')
and to keep it alive you could try, opening a connection before starting to serve the app, store the connection object somewhere so it doesn't get garbage collected, and proceed as usual opening and closing other connections to it (on per-request basis).
SOS: Keep in mind I have only tested it in a single thread script to check if a new sqlite3.connect(':memory:') connects to the same database that we have already loaded (it does).
I do not know how well it would play with flask's threads, or sqlite it self.
UPDATE:
Here's my approach, more info below:
class db_test:
# DOES NOT INCLUDE LOADING THE FILE TO MEMORY AND VICE VERSA (out of the scope of the question)
def __init__(self):
self.db = sqlite3.connect(":memory:", check_same_thread=False)
def execute_insert(self, query: str, data: tuple):
cur = self.db.cursor()
with self.db:
cur.execute(query, data)
cur.close()
The above class is instantiated once in the beginning of my flask app, right after imports like so:
from classes import db_test
db = db_test()
This avoids garbage collection.
To use, simply call where is need like so:
#app.route("/db_test")
def db_test():
db.execute_insert("INSERT INTO table (entry) VALUES (?)", ('hello', ))
return render_template("db_test.html")
Notes:
You might have noticed the 2nd argument in self.db = sq.connect(":memory:", check_same_thread=False). This makes it possible to use connections and cursors created in different threads (as flask does), but at the risk of collisions and corrupting data/entries.
From my understanding (regarding my setup flask->waitress->nginx), unless explicitely set to some multithreaded/multiprocessing mode, flask will process each request start-to-finish and then proceed to the next. Thus rendering above danger, irrelevant.
I set up a rudimentary test to see if my theory holds up. I would insert an incremental number every time a page is requested. I then SPAMMED refresh on pc, laptop & mobile. The resulting 164 entries were checked for integrity manually and passed.
Finally: Keep in mind that I might be missing something, that my methodology is not of a stress-test and the differences between our setups.
Hope this helps!
PS: The first approach I suggested could not be replicated inside flask. I suspect that is due to flasks thread activity.
Applications often need to connect to other services (a database, a cache, an API, etc). For sanity and DRY, we'd like to keep all of these connections in one module so the rest of our code base can share connections.
To reduce boilerplate, downstream usage should be simple:
# app/do_stuff.py
from .connections import AwesomeDB
db = AwesomeDB()
def get_stuff():
return db.get('stuff')
And setting up the connection should also be simple:
# app/cli.py or some other main entry point
from .connections import AwesomeDB
db = AwesomeDB()
db.init(username='stuff admin') # Or os.environ['DB_USER']
Web frameworks like Django and Flask do something like this, but it feels a bit clunky:
Connect to a Database in Flask, Which Approach is better?
http://flask.pocoo.org/docs/0.10/tutorial/dbcon/
One big issue with this is that we want a reference to the actual connection object instead of a proxy, because we want to retain tab-completion in iPython and other dev environments.
So what's the Right Way (tm) to do it? After a few iterations, here's my idea:
#app/connections.py
from awesome_database import AwesomeDB as RealAwesomeDB
from horrible_database import HorribleDB as RealHorribleDB
class ConnectionMixin(object):
__connection = None
def __new__(cls):
cls.__connection = cls.__connection or object.__new__(cls)
return cls.__connection
def __init__(self, real=False, **kwargs):
if real:
super().__init__(**kwargs)
def init(self, **kwargs):
kwargs['real'] = True
self.__init__(**kwargs)
class AwesomeDB(ConnectionMixin, RealAwesomeDB):
pass
class HorribleDB(ConnectionMixin, RealHorribleDB):
pass
Room for improvement: Set initial __connection to a generic ConnectionProxy instead of None, which catches all attribute access and throws an exception.
I've done quite a bit of poking around here on SO and in various OSS projects and haven't seen anything like this. It feels pretty solid, though it does mean a bunch of modules will be instantiating connection objects as a side effect at import time. Will this blow up in my face? Are there any other negative consequences to this approach?
First, design-wise, I might be missing something, but I don't see why you need the heavy mixin+singleton machinery instead of just defining a helper like so:
_awesome_db = None
def awesome_db(**overrides):
global _awesome_db
if _awesome_db is None:
# Read config/set defaults.
# overrides.setdefault(...)
_awesome_db = RealAwesomeDB(**overrides)
return _awesome_db
Also, there is a bug that might not look like a supported use-case, but anyway: if you make the following 2 calls in a row, you would wrongly get the same connection object twice even though you passed different parameters:
db = AwesomeDB()
db.init(username='stuff admin')
db = AwesomeDB()
db.init(username='not-admin') # You'll get admin connection here.
An easy fix for that would be to use a dict of connections keyed on the input parameters.
Now, on the essence of the question.
I think the answer depends on how your "connection" classes are actually implemented.
Potential downsides with your approach I see are:
In a multithreaded environment you could get problems with unsychronized concurrent access to the global connection object from multiple threads, unless it is already thread-safe. If you care about that, you could change your code and interface a bit and use a thread-local variable.
What if a process forks after creating the connection? Web application servers tend to do that and it might not be safe, again depending on the underlying connection.
Does the connection object have state? What happens if the connection object becomes invalid (due to i.e. connection error/time out)? You might need to replace the broken connection with a new one to return the next time a connection is requested.
Connection management is often already efficiently and safely implemented through a connection pool in client libraries.
For example, the redis-py Redis client uses the following implementation:
https://github.com/andymccurdy/redis-py/blob/1c2071762ad9b9288e786665990083e61c1cf355/redis/connection.py#L974
The Redis client then uses the connection pool like so:
Requests a connection from the connection pool.
Tries to execute a command on the connection.
If the connection fails, the client closes it.
In any case, finaly it is returned to the connection pool so it can be reused by subsequent calls or other threads.
So since the Redis client handles all of that under the hood, you can safely do what you want directly. Connections will be lazily created until the connection pool reaches full capacity.
# app/connections.py
def redis_client(**kwargs):
# Maybe read configuration/set default arguments
# kwargs.setdefault()
return redis.Redis(**kwargs)
Similarly, SQLAlchemy can use connection pooling as well.
To summarize, my understanding is that:
If your client library supports connection pooling, you don't need to do anything special to share connections between modules and even threads. You could just define a helper similar to redis_client() that reads configuration, or specifies default parameters.
If your client library provides only low-level connection objects, you will need to make sure access to them is thread-safe and fork-safe. Also, you need to make sure each time you return a valid connection (or raise an exception if you can't establish or reuse an existing one).
I am using Flask + SQLAlchemy (DB is Postgres) for my server, and am wondering how connection pooling happens. I know that it is enabled by default with a pool size of 5, but I don't know if my code triggers it.
Assuming I use the default flask SQLalchemy bridge :
db = SQLAlchemy(app)
And then use that object to place database calls like
db.session.query(......)
How does flask-sqlalchemy manage the connection pool behind the scene? Does it grab a new session every time I access db.session? When is this object returned to the pool (assuming I don't store it in a local variable)?
What is the correct pattern to write code to maximize concurrency + performance? If I access the DB multiple times in one serial method, is it a good idea to use db.session every time?
I was unable to find documentation on this manner, so I don't know what is happening behind the scene (the code works, but will it scale?)
Thanks!
You can use event registration - http://docs.sqlalchemy.org/en/latest/core/event.html#event-registration
There are many different event types that can be monitored, checkout, checkin, connect etc... - http://docs.sqlalchemy.org/en/latest/core/events.html
Here is a basic example from the docs on printing a when a new connection is established.
from sqlalchemy.event import listen
from sqlalchemy.pool import Pool
def my_on_connect(dbapi_con, connection_record):
print "New DBAPI connection:", dbapi_con
listen(Pool, 'connect', my_on_connect)
I'm running PostgreSQL 9.3 and SQLAlchemy 0.8.2 and experience database connections leaking. After deploying the app consumes around 240 connections. Over next 30 hours this number gradually grows to 500, when PostgreSQL will start dropping connections.
I use SQLAlchemy thread-local sessions:
from sqlalchemy import orm, create_engine
engine = create_engine(os.environ['DATABASE_URL'], echo=False)
Session = orm.scoped_session(orm.sessionmaker(engine))
For the Flask web app, the .remove() call to the Session proxy-object is send during request teardown:
#app.teardown_request
def teardown_request(exception=None):
if not app.testing:
Session.remove()
This should be the same as what Flask-SQLAlchemy is doing.
I also have some periodic tasks that run in a loop, and I call .remove() for every iteration of the loop:
def run_forever():
while True:
do_stuff(Session)
Session.remove()
What am I doing wrong which could lead to a connection leak?
If I remember correctly from my experiments with SQLAlchemy, the scoped_session() is used to create sessions that you can access from multiple places. That is, you create a session in one method and use it in another without explicitly passing the session object around.
It does that by keeping a list of sessions and associating them with a "scope ID". By default, to obtain a scope ID, it uses the current thread ID; so you have session per thread. You can supply a scopefunc to provide – for example – one ID per request:
# This is (approx.) what flask-sqlalchemy does:
from flask import _request_ctx_stack as context_stack
Session = orm.scoped_session(orm.sessionmaker(engine),
scopefunc=context_stack.__ident_func__)
Also, take note of the other answers and comments about doing background tasks.
First of all, it is a really really bad way to run background tasks. Try any ASync scheduler like celery.
Not 100% sure so this is a bit of a guess based on the information provided, but I wonder if each page load is starting a new db connection which is then listening for notifications. If this is the case, I wonder if the db connection is effectively removed from the pool and so gets created on the next page load.
If this is the case, my recommendation would be to have a separate DBI database handle dedicated to listening for notifications so that these are not active in the queue. This might be done outside your workflow.
Also
Particularly, the leak is happening when making more than one simultaneous requests. At the same time, I could see some of the requests were left with uncompleted query execution and timing out. You can write something to manage this yourself.
When accessing a MySQL database on low level using python, I use the MySQLdb module.
I create a connection instance, then a cursor instance then I pass it to every function, that needs the cursor.
Sometimes I have many nested function calls, all desiring the mysql_cursor. Would it hurt to initialise the connection as global variable, so I can save me a parameter for each function, that needs the cursor?
I can deliver an example, if my explanation was insufficient...
I think that database cursors are scarce resources, so passing them around can limit your scalability and cause management issues (e.g. which method is responsible for closing the connection)?
I'd recommend pooling connections and keeping them open for the shortest time possible. Check out the connection, perform the database operation, map any results to objects or data structures, and close the connection. Pass the object or data structure with results around rather than passing the cursor itself. The cursor scope should be narrow.