I've having problems sending proxy objects across a TCP connection with Python 3.7.3. They work fine locally, but the authentication keys don't get set right when the two processes interconnect with TCP.
The story goes like this: I have an instance of a class on one process and I want to refer to it from another process on a different machine. So I create my BaseManager on the first process with address=('', 50000), pull out a copy of its _authkey, and tell the other process to create a BaseManager with address=('whatever', 50000) along with authkey=... and call connect(). So far, so good.
The first process has a few things registered:
BaseManager.register('ManagerClass', ManagerClass)
BaseManager.register('managers', callable = lambda: managers)
managers is just a dictionary. ManagerClass has a method that saves a self-proxy in the dictionary, created like this:
def autoself(self):
server = getattr(multiprocessing.current_process(), '_manager_server', None)
classname = self.__class__.__name__
if server:
for key,value in server.id_to_obj.items():
if value[0] == self:
token = multiprocessing.managers.Token(typeid=classname, address=server.address, id=key)
proxy = multiprocessing.managers.AutoProxy(token, 'pickle', authkey=server.authkey)
return proxy
else:
return self
Incidentally, if I try to store the ManagerClass object directly in the dictionary and then transfer it, I get
TypeError: can't pickle _thread.lock objects
No great surprise - it's a complicated object and probably has thread locks in there somewhere.
So, I store the self-proxy created with autoself into the dictionary and transfer it.
It almost works, but the authkey doesn't get set right on the receiving end, so it doesn't work. Looks like the authkey gets set to the local process's authkey, because that's the default in AutoProxy if no authkey or manager is specified.
Well, how would it be specified? The dictionary is represented by a proxy object, which calls the remote method items(), which returns a pickled RebuildProxy containing an AutoProxy. Should RebuildProxy figure out what manager it's being called from and pick out the authkey from there? What if the returned proxy object refers to an object on a different process than the one holding the dictionary? Don't know.
I've "fixed" this by hacking BaseProxy's __reduce__ method to always transfer its authkey, irregardless of get_spawning_popen(), and disabling the check in AuthenticationString that prevents it being pickled. Look in Python's multiprocessing/ directory to make sense out of what I'm talking about.
Now it works.
Not really sure what to do about it. What if we've got a complicated setup with multiple processes passing proxy objects around? No reason to assume that a process receiving a proxy object has an authkey for the process that manages the object; it might only have an authkey for the process sending the proxy. Does that mean we need to pass authkeys around with proxy objects? Or am I missing a better way to do this?
I've found a simple way to fix my problem: use the same authentication key for every process on every machine.
Just set the key first thing when I import multiprocessing:
import multiprocessing
multiprocessing.current_process().authkey = b"secret"
No need to change the standard library code.
Related
Supposing I have a replicated master-slave Redis setup, how do I actually access it from a client library? Presumably I need a client instance per each host and I need to decide which I want to use for writing and reading, like:
master = redis.Redis(master_host, port)
clients = [redis.Redis(host, port) for host in slave_hosts]
master.set('key', 'value')
client = random.choice(clients)
client.get('other-key')
I was thinking there should be some magic in the library where I could provide a list of hosts for making such routing automatic, but couldn't find it.
I've also looked into redis-cluster and redis-sentinel and they all start by talking about automatic failovers when slaves become masters, and I'm not sure it's what I need. What I need is a consistent master which I can afford to lose for some time (I can hold up updates in a queue).
Are you intentionally splitting reads and writes, and doing so because you already know you will be overwhelming the Redis instance? If not, don't worry about splitting r/w between servers. Use Sentinel from the client as a lookup to see what node is the master and connect to it to do all of your reads and writes.
Who you do eventually have to split reads off your code will need to written such that you establish a connection for each read-slave and only send reads to it. You'll need to detect failovers to redistribute your r/w split.
Now if you have a separate read-only process, you can either let it query sentinel for slave in place of master, or you can set up a non-promotable slave to use for that process - though if it goes down you'd lose your access.
You Aint Gonna Need It (YAGNI) is a good principle to follow here, as is avoiding premature optimization. A single Redis instance can be incredibly fast and doesn't suffer the same performance drops due to highly complex queries you find in traditional SQL datastore. So I would recommend that absent data you run with a standard setup where you simply query sentinel for current master and use the one it returns.
Sentinel can do that:
You can also create Redis client connections from a Sentinel instance. You can connect to either the master (for write operations) or a slave (for read-only operations).
>>> master = sentinel.master_for('mymaster', socket_timeout=0.1)
>>> slave = sentinel.slave_for('mymaster', socket_timeout=0.1)
>>> master.set('foo', 'bar')
>>> slave.get('foo')
'bar'
Read more Redis python client
I am using Redis as well. In my case I have a wrapper class like:
class RedisConnection:
def __init__(self, master_host, slave_host):
self.__master = redis.Redis...
self.__slave == redis.Redis...
def write_xxx(self, value):
self.__master.set(...)
def read_xxx(self):
self.__slave.get(...)
Applications often need to connect to other services (a database, a cache, an API, etc). For sanity and DRY, we'd like to keep all of these connections in one module so the rest of our code base can share connections.
To reduce boilerplate, downstream usage should be simple:
# app/do_stuff.py
from .connections import AwesomeDB
db = AwesomeDB()
def get_stuff():
return db.get('stuff')
And setting up the connection should also be simple:
# app/cli.py or some other main entry point
from .connections import AwesomeDB
db = AwesomeDB()
db.init(username='stuff admin') # Or os.environ['DB_USER']
Web frameworks like Django and Flask do something like this, but it feels a bit clunky:
Connect to a Database in Flask, Which Approach is better?
http://flask.pocoo.org/docs/0.10/tutorial/dbcon/
One big issue with this is that we want a reference to the actual connection object instead of a proxy, because we want to retain tab-completion in iPython and other dev environments.
So what's the Right Way (tm) to do it? After a few iterations, here's my idea:
#app/connections.py
from awesome_database import AwesomeDB as RealAwesomeDB
from horrible_database import HorribleDB as RealHorribleDB
class ConnectionMixin(object):
__connection = None
def __new__(cls):
cls.__connection = cls.__connection or object.__new__(cls)
return cls.__connection
def __init__(self, real=False, **kwargs):
if real:
super().__init__(**kwargs)
def init(self, **kwargs):
kwargs['real'] = True
self.__init__(**kwargs)
class AwesomeDB(ConnectionMixin, RealAwesomeDB):
pass
class HorribleDB(ConnectionMixin, RealHorribleDB):
pass
Room for improvement: Set initial __connection to a generic ConnectionProxy instead of None, which catches all attribute access and throws an exception.
I've done quite a bit of poking around here on SO and in various OSS projects and haven't seen anything like this. It feels pretty solid, though it does mean a bunch of modules will be instantiating connection objects as a side effect at import time. Will this blow up in my face? Are there any other negative consequences to this approach?
First, design-wise, I might be missing something, but I don't see why you need the heavy mixin+singleton machinery instead of just defining a helper like so:
_awesome_db = None
def awesome_db(**overrides):
global _awesome_db
if _awesome_db is None:
# Read config/set defaults.
# overrides.setdefault(...)
_awesome_db = RealAwesomeDB(**overrides)
return _awesome_db
Also, there is a bug that might not look like a supported use-case, but anyway: if you make the following 2 calls in a row, you would wrongly get the same connection object twice even though you passed different parameters:
db = AwesomeDB()
db.init(username='stuff admin')
db = AwesomeDB()
db.init(username='not-admin') # You'll get admin connection here.
An easy fix for that would be to use a dict of connections keyed on the input parameters.
Now, on the essence of the question.
I think the answer depends on how your "connection" classes are actually implemented.
Potential downsides with your approach I see are:
In a multithreaded environment you could get problems with unsychronized concurrent access to the global connection object from multiple threads, unless it is already thread-safe. If you care about that, you could change your code and interface a bit and use a thread-local variable.
What if a process forks after creating the connection? Web application servers tend to do that and it might not be safe, again depending on the underlying connection.
Does the connection object have state? What happens if the connection object becomes invalid (due to i.e. connection error/time out)? You might need to replace the broken connection with a new one to return the next time a connection is requested.
Connection management is often already efficiently and safely implemented through a connection pool in client libraries.
For example, the redis-py Redis client uses the following implementation:
https://github.com/andymccurdy/redis-py/blob/1c2071762ad9b9288e786665990083e61c1cf355/redis/connection.py#L974
The Redis client then uses the connection pool like so:
Requests a connection from the connection pool.
Tries to execute a command on the connection.
If the connection fails, the client closes it.
In any case, finaly it is returned to the connection pool so it can be reused by subsequent calls or other threads.
So since the Redis client handles all of that under the hood, you can safely do what you want directly. Connections will be lazily created until the connection pool reaches full capacity.
# app/connections.py
def redis_client(**kwargs):
# Maybe read configuration/set default arguments
# kwargs.setdefault()
return redis.Redis(**kwargs)
Similarly, SQLAlchemy can use connection pooling as well.
To summarize, my understanding is that:
If your client library supports connection pooling, you don't need to do anything special to share connections between modules and even threads. You could just define a helper similar to redis_client() that reads configuration, or specifies default parameters.
If your client library provides only low-level connection objects, you will need to make sure access to them is thread-safe and fork-safe. Also, you need to make sure each time you return a valid connection (or raise an exception if you can't establish or reuse an existing one).
I have a Model class which is part of my self-crafted ORM. It has all kind of methods like save(), create() and so on. Now, the thing is that all these methods require a connection object to act properly. And I have no clue on what's the best approach to feed a Model object with a connection object.
What I though of so far:
provide a connection object in a Model's __init__(); this will work, by setting an instance variable and use it throughout the methods, but it will kind of break the API; users shouldn't always feed a connection object when they create a Model object;
create the connection object separately, store it somewhere (where?) and on Model's __init__() get the connection from where it has been stored and put it in an instance variable (this is what I thought to be the best approach, but have no idea of the best spot to store that connection object);
create a connection pool which will be fed with the connection object, then on Model's __init__() fetch the connection from the connection pool (how do I know which connection to fetch from the pool?).
If there are any other approached, please do tell. Also, I would like to know which is the proper way to this.
Here's how I would do:
Use a connection pool with a queue interface. You don't have to choose a connection object, you just pick the next on the line. This can be done whenever you need transaction, and put back afterwards.
Unless you have some very specific needs, I would use a Singleton class for the database connection. No need to pass parameters on the constructor every time.
For testing, you just put a mocked database connection on the Singleton class.
Edit:
About the connection pool questions (I could be wrong here, but it would be my first try):
Keep all connections open. Pop when you need, put when you don't need it anymore, just like a regular queue. This queue could be exposed from the Singleton.
You start with a fixed, default number of connections (like 20). You could override the pop method, so when the queue is empty you block (wait for another to free if the program is multi-threaded) or create a new connection on the fly.
Destroying connections is more subtle. You need to keep track of how many connections the program is using, and how likely it is you have too many connections. Take care, because destroying a connection that will be needed later slows the program down. In the end, it's a n heuristic problem that changes the performance characteristics.
I am trying to understand a simple python proxy example using Twisted located here. The proxy instantiates a Server Class, which in turn instantiates a client class. defer.DeferredQueue() is used to pass data from client class to server class.
I am now trying to understand how defer.DeferredQueue() works in this example. For example what is the significance of this statement:
self.srv_queue.get().addCallback(self.clientDataReceived)
and it's analogous
self.cli_queue.get().addCallback(self.serverDataReceived)
statement.
What happens when self.cli_queue.put(False) or self.cli_queue = None is executed?
Just trying to get into grips with Twisted now, so things seems pretty daunting. A small explanation of how things are connected would make it far more easy to get into grips with this.
According to the documentation, DeferredQueue has a normal put method to add object to queue and a deferred get method.
The get method returns a Deferred object. You add a callback method (e.g serverDataReceived) to the object. Whenever the object available in the queue, the Deferred object will invoke the callback method. The object will be passed as argument to the method. In case the queue is empty or the serverDataReceived method hasn't finished executing, your program still continues to execute next statements. When new object available in the queue, the callback method will be called regardless of the point of execution of your program.
In other words, it is an asynchronous flow, in contrary to a synchronous flow model, in which, you might have a BlockingQueue, i.e, your program will wait until the next object available in the queue for it to continue executing.
In your example program self.cli_queue.put(False) add a False object to the queue. It is a sort of flag to tell the ProxyClient thread that there won't be anymore data added to the queue. So that it should disconnect the remote connection. You can refer to this portion of code:
def serverDataReceived(self, chunk):
if chunk is False:
self.cli_queue = None
log.msg("Client: disconnecting from peer")
self.factory.continueTrying = False
self.transport.loseConnection()
Set the cli_queue = None is just to discard the queue after the connection is closed.
I am basically familiar with RPC solutions available in Python: XML-RPC and Pyro. I can make an remote object by binding it on the server-side and then I can get proxy object on the client side on which I can operate. When I call some method on remote object e.g. proxy.get_file() then the rpc mechanism tries to serialize a resultant object (a file in this case). This is usually expected behavior, but what I need is to get a file object as another remote proxy object instead of getting it transferred to client side:
afile_proxy = proxy.get_file()
Instead of:
afile = proxy.get_file()
I could rebind this object on server-side and handle such case on the client side but this would require some boiler-plate code. Is there a mechanism/library that would do this for me? It could for example keep objects remote until they are primitive ones.
I have found a library that does exactly what I need: RPyC. From intro:
simple, immutable python objects (like strings, integers, tuples, etc.) are passed by value, meaning the value itself is passed to the other side.
all other objects are passed by reference, meaning a "reference" to the object is passed to the other side. This allows changes applied on the referenced object to be reflected on the actual object.
Anyway, thanks for pointing out a 'reference' term. :)
I am involved in developing a new remote-object interaction framework Versile Python (VPy) which performs similar functions as the ones you have listed. VPy is in development with the current releases primarily intended for testing, however feel free to take a look.
There are two ways you could perform the type of remote I/O you are describing with VPy. One is to use remote native object references to e.g. file objects similar to Pyro/RPyC and access those objects similar to if they were local.
Another option is to use the VPy remote stream framework which is quite flexible and can be configured to perform bi-directional streaming and operations such as remotely repositioning or truncating the stream. The second approach has the advantage it enables asynchronous I/O plus the stream framework splits up data transmission in order to reduce the effects of round-trip latency.
afile_proxy = proxy.get_file_proxy()
And define in the API what a FileProxy object is. It all depends on what the client needs to do with the proxy. Ge the name? Delete it? Get its contents?
You could even get away with a reference (a URL, maybe) if all you want is to keep track of something you want to process later. It's what's done on the web with all embedded content, like images.