I have a Model class which is part of my self-crafted ORM. It has all kind of methods like save(), create() and so on. Now, the thing is that all these methods require a connection object to act properly. And I have no clue on what's the best approach to feed a Model object with a connection object.
What I though of so far:
provide a connection object in a Model's __init__(); this will work, by setting an instance variable and use it throughout the methods, but it will kind of break the API; users shouldn't always feed a connection object when they create a Model object;
create the connection object separately, store it somewhere (where?) and on Model's __init__() get the connection from where it has been stored and put it in an instance variable (this is what I thought to be the best approach, but have no idea of the best spot to store that connection object);
create a connection pool which will be fed with the connection object, then on Model's __init__() fetch the connection from the connection pool (how do I know which connection to fetch from the pool?).
If there are any other approached, please do tell. Also, I would like to know which is the proper way to this.
Here's how I would do:
Use a connection pool with a queue interface. You don't have to choose a connection object, you just pick the next on the line. This can be done whenever you need transaction, and put back afterwards.
Unless you have some very specific needs, I would use a Singleton class for the database connection. No need to pass parameters on the constructor every time.
For testing, you just put a mocked database connection on the Singleton class.
Edit:
About the connection pool questions (I could be wrong here, but it would be my first try):
Keep all connections open. Pop when you need, put when you don't need it anymore, just like a regular queue. This queue could be exposed from the Singleton.
You start with a fixed, default number of connections (like 20). You could override the pop method, so when the queue is empty you block (wait for another to free if the program is multi-threaded) or create a new connection on the fly.
Destroying connections is more subtle. You need to keep track of how many connections the program is using, and how likely it is you have too many connections. Take care, because destroying a connection that will be needed later slows the program down. In the end, it's a n heuristic problem that changes the performance characteristics.
Related
I've having problems sending proxy objects across a TCP connection with Python 3.7.3. They work fine locally, but the authentication keys don't get set right when the two processes interconnect with TCP.
The story goes like this: I have an instance of a class on one process and I want to refer to it from another process on a different machine. So I create my BaseManager on the first process with address=('', 50000), pull out a copy of its _authkey, and tell the other process to create a BaseManager with address=('whatever', 50000) along with authkey=... and call connect(). So far, so good.
The first process has a few things registered:
BaseManager.register('ManagerClass', ManagerClass)
BaseManager.register('managers', callable = lambda: managers)
managers is just a dictionary. ManagerClass has a method that saves a self-proxy in the dictionary, created like this:
def autoself(self):
server = getattr(multiprocessing.current_process(), '_manager_server', None)
classname = self.__class__.__name__
if server:
for key,value in server.id_to_obj.items():
if value[0] == self:
token = multiprocessing.managers.Token(typeid=classname, address=server.address, id=key)
proxy = multiprocessing.managers.AutoProxy(token, 'pickle', authkey=server.authkey)
return proxy
else:
return self
Incidentally, if I try to store the ManagerClass object directly in the dictionary and then transfer it, I get
TypeError: can't pickle _thread.lock objects
No great surprise - it's a complicated object and probably has thread locks in there somewhere.
So, I store the self-proxy created with autoself into the dictionary and transfer it.
It almost works, but the authkey doesn't get set right on the receiving end, so it doesn't work. Looks like the authkey gets set to the local process's authkey, because that's the default in AutoProxy if no authkey or manager is specified.
Well, how would it be specified? The dictionary is represented by a proxy object, which calls the remote method items(), which returns a pickled RebuildProxy containing an AutoProxy. Should RebuildProxy figure out what manager it's being called from and pick out the authkey from there? What if the returned proxy object refers to an object on a different process than the one holding the dictionary? Don't know.
I've "fixed" this by hacking BaseProxy's __reduce__ method to always transfer its authkey, irregardless of get_spawning_popen(), and disabling the check in AuthenticationString that prevents it being pickled. Look in Python's multiprocessing/ directory to make sense out of what I'm talking about.
Now it works.
Not really sure what to do about it. What if we've got a complicated setup with multiple processes passing proxy objects around? No reason to assume that a process receiving a proxy object has an authkey for the process that manages the object; it might only have an authkey for the process sending the proxy. Does that mean we need to pass authkeys around with proxy objects? Or am I missing a better way to do this?
I've found a simple way to fix my problem: use the same authentication key for every process on every machine.
Just set the key first thing when I import multiprocessing:
import multiprocessing
multiprocessing.current_process().authkey = b"secret"
No need to change the standard library code.
I've got a simple webservice which uses SQLAlchemy to connect to a database using the pattern
engine = create_engine(database_uri)
connection = engine.connect()
In each endpoint of the service, I then use the same connection, in the following fashion:
for result in connection.execute(query):
<do something fancy>
Since Sessions are not thread-safe, I'm afraid that connections aren't either.
Can I safely keep doing this? If not, what's the easiest way to fix it?
Minor note -- I don't know if the service will ever run multithreaded, but I'd rather be sure that I don't get into trouble when it does.
Short answer: you should be fine.
There is a difference between a connection and a Session. The short description is that connections represent just that… a connection to a database. Information you pass into it will come out pretty plain. It won't keep track of your transactions unless you tell it to, and it won't care about what order you send it data. So if it matters that you create your Widget object before you create your Sprocket object, then you better call that in a thread-safe context. Same generally goes for if you want to keep track of a database transaction.
Session, on the other hand, keeps track of data and transactions for you. If you check out the source code, you'll notice quite a bit of back and forth over database transactions and without a way to know that you have everything you want in a transaction, you could very well end up committing in one thread while you expect to be able to add another object (or several) in another.
In case you don't know what a transaction is this is Wikipedia, but the short version is that transactions help make sure your data stays stable. If you have 15 inserts and updates, and insert 15 fails, you might not want to make the other 14. A transaction would let you cancel the entire operation in bulk.
Applications often need to connect to other services (a database, a cache, an API, etc). For sanity and DRY, we'd like to keep all of these connections in one module so the rest of our code base can share connections.
To reduce boilerplate, downstream usage should be simple:
# app/do_stuff.py
from .connections import AwesomeDB
db = AwesomeDB()
def get_stuff():
return db.get('stuff')
And setting up the connection should also be simple:
# app/cli.py or some other main entry point
from .connections import AwesomeDB
db = AwesomeDB()
db.init(username='stuff admin') # Or os.environ['DB_USER']
Web frameworks like Django and Flask do something like this, but it feels a bit clunky:
Connect to a Database in Flask, Which Approach is better?
http://flask.pocoo.org/docs/0.10/tutorial/dbcon/
One big issue with this is that we want a reference to the actual connection object instead of a proxy, because we want to retain tab-completion in iPython and other dev environments.
So what's the Right Way (tm) to do it? After a few iterations, here's my idea:
#app/connections.py
from awesome_database import AwesomeDB as RealAwesomeDB
from horrible_database import HorribleDB as RealHorribleDB
class ConnectionMixin(object):
__connection = None
def __new__(cls):
cls.__connection = cls.__connection or object.__new__(cls)
return cls.__connection
def __init__(self, real=False, **kwargs):
if real:
super().__init__(**kwargs)
def init(self, **kwargs):
kwargs['real'] = True
self.__init__(**kwargs)
class AwesomeDB(ConnectionMixin, RealAwesomeDB):
pass
class HorribleDB(ConnectionMixin, RealHorribleDB):
pass
Room for improvement: Set initial __connection to a generic ConnectionProxy instead of None, which catches all attribute access and throws an exception.
I've done quite a bit of poking around here on SO and in various OSS projects and haven't seen anything like this. It feels pretty solid, though it does mean a bunch of modules will be instantiating connection objects as a side effect at import time. Will this blow up in my face? Are there any other negative consequences to this approach?
First, design-wise, I might be missing something, but I don't see why you need the heavy mixin+singleton machinery instead of just defining a helper like so:
_awesome_db = None
def awesome_db(**overrides):
global _awesome_db
if _awesome_db is None:
# Read config/set defaults.
# overrides.setdefault(...)
_awesome_db = RealAwesomeDB(**overrides)
return _awesome_db
Also, there is a bug that might not look like a supported use-case, but anyway: if you make the following 2 calls in a row, you would wrongly get the same connection object twice even though you passed different parameters:
db = AwesomeDB()
db.init(username='stuff admin')
db = AwesomeDB()
db.init(username='not-admin') # You'll get admin connection here.
An easy fix for that would be to use a dict of connections keyed on the input parameters.
Now, on the essence of the question.
I think the answer depends on how your "connection" classes are actually implemented.
Potential downsides with your approach I see are:
In a multithreaded environment you could get problems with unsychronized concurrent access to the global connection object from multiple threads, unless it is already thread-safe. If you care about that, you could change your code and interface a bit and use a thread-local variable.
What if a process forks after creating the connection? Web application servers tend to do that and it might not be safe, again depending on the underlying connection.
Does the connection object have state? What happens if the connection object becomes invalid (due to i.e. connection error/time out)? You might need to replace the broken connection with a new one to return the next time a connection is requested.
Connection management is often already efficiently and safely implemented through a connection pool in client libraries.
For example, the redis-py Redis client uses the following implementation:
https://github.com/andymccurdy/redis-py/blob/1c2071762ad9b9288e786665990083e61c1cf355/redis/connection.py#L974
The Redis client then uses the connection pool like so:
Requests a connection from the connection pool.
Tries to execute a command on the connection.
If the connection fails, the client closes it.
In any case, finaly it is returned to the connection pool so it can be reused by subsequent calls or other threads.
So since the Redis client handles all of that under the hood, you can safely do what you want directly. Connections will be lazily created until the connection pool reaches full capacity.
# app/connections.py
def redis_client(**kwargs):
# Maybe read configuration/set default arguments
# kwargs.setdefault()
return redis.Redis(**kwargs)
Similarly, SQLAlchemy can use connection pooling as well.
To summarize, my understanding is that:
If your client library supports connection pooling, you don't need to do anything special to share connections between modules and even threads. You could just define a helper similar to redis_client() that reads configuration, or specifies default parameters.
If your client library provides only low-level connection objects, you will need to make sure access to them is thread-safe and fork-safe. Also, you need to make sure each time you return a valid connection (or raise an exception if you can't establish or reuse an existing one).
I have tens (potentially hundreds) of thousands of persistent objects that I want to generate in a multithreaded fashion due the processing required.
While the creation of the objects happens in separate threads (using Flask-SQLAlchemy extension btw with scoped sessions) the call to write the generated objects to the DB happens in 1 place after the generation has completed.
The problem, I believe, is that the objects being created are part of several existing relationships-- thereby triggering the automatic addition to the identity map despite being created in separate, concurrent, threads with no explicit session in any of the threads.
I was hoping to contain the generated objects in a single list, and then write the whole list (using a single session object) to the database. This results in an error like this:
AssertionError: A conflicting state is already present in the identity map for key (<class 'app.ModelObject'>, (1L,))
Hence why I believe the identity map has already been populated, because it's when I try to add and commit using the global session outside of the concurrent code, the assertion error is triggered.
The final detail is that whatever session object(s), (scoped or otherwise, as I don't fully understand how automatic addition to the identity map works in the case of multithreading) I cannot find a way / don't know how to get a reference to them so that even if I wanted to deal with a separate session per process I could.
Any advice is greatly appreciated. The only reason I am not posting code (yet) is because it's difficult to abstract a working example immediately out of my app. I will post if somebody really needs to see it though.
Each session is thread-local; in other words there is a separate session for each thread. If you decide to pass some instances to another thread, they will become "detached" from the session. Use db.session.add_all(objects) in the receiving thread to put them all back.
For some reason, it looks like you're creating objects with the same identity (primary key columns) in different threads, then trying to send them both to the database. One option is to fix why this is happening, so that identities will be guaranteed unique. You may also try merging; merged_object = db.session.merge(other_object, load=False).
Edit: zzzeek's comment clued me in on something else that may be going on:
With Flask-SQLAlchemy, the session is tied to the app context. Since that is thread local, spawning a new thread will invalidate the context; there will be no database session in the threads. All the instances are detached there, and cannot properly track relationships. One solution is to pass app to each thread and perform everything within a with app.app_context(): block. Inside the block, first use db.session.add to populate the local session with the passed instances. You should still merge in the master task afterwards to ensure consistency.
I just want to clarify the problem and the solution with some pseudo-code in case somebody has this problem / wants to do this in the future.
class ObjA(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjB(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjC(object):
obj_a_id = Column(Integer, ForeignKey('obj_a.id'))
obj_b_id = Column(Integer, ForeignKey('obj_b.id'))
def __init__(self, obj_a, obj_b):
self.obj_a = obj_a
self.obj_b = obj_b
def make_a_bunch_of_c(obj_a, list_of_b=None):
return [ObjC(obj_a, obj_b) for obj_b in list_of_b]
def parallel_generate():
list_of_a = session.query(ObjA).all() # assume there are 1000 of these
list_of_b = session.query(ObjB).all() # and 30 of these
fxn = functools.partial(make_a_bunch_of_c, list_of_b=list_of_b)
pool = multiprocessing.Pool(10)
all_the_things = pool.map(fxn, list_of_a)
return all_the_things
Now let's stop here a second. The original problem was that attempting to ADD the list of ObjC's caused the error message in the original question:
session.add_all(all_the_things)
AssertionError: A conflicting state is already present in the identity map for key [...]
Note: The error occurs during the adding phase, the commit attempt never even happens because the assertion occurs pre-commit. As far as I could tell.
Solution:
all_the_things = parallel_generate()
for thing in all_the_things:
session.merge(thing)
session.commit()
The details of session utilization when dealing with automatically added objects (via the relationship cascading) is still beyond me and I cannot explain why the conflict originally occurred. All I know is that using the merge function will cause SQLAlchemy to sort all of the child objects that were created across 10 different processes into a single session in the master process.
I would be curious in the why, if anyone happens across this.
When accessing a MySQL database on low level using python, I use the MySQLdb module.
I create a connection instance, then a cursor instance then I pass it to every function, that needs the cursor.
Sometimes I have many nested function calls, all desiring the mysql_cursor. Would it hurt to initialise the connection as global variable, so I can save me a parameter for each function, that needs the cursor?
I can deliver an example, if my explanation was insufficient...
I think that database cursors are scarce resources, so passing them around can limit your scalability and cause management issues (e.g. which method is responsible for closing the connection)?
I'd recommend pooling connections and keeping them open for the shortest time possible. Check out the connection, perform the database operation, map any results to objects or data structures, and close the connection. Pass the object or data structure with results around rather than passing the cursor itself. The cursor scope should be narrow.