My client API encapsulates connections to the server in a class ServerConnection that stores an asyncio.StreamReader/-Writer pair. (For simplicity, I will not use yield from or any other async technique in the code below because it doesn't matter for my question.)
Now, I wonder whether it's better to expose both streams as part of ServerConnection's public interface than to introduce read()/write() methods to ServerConnection which forward calls to the respective stream's methods. Example:
class ServerConnection:
def __init__(self, reader, writer):
self.reader = reader
self.writer = writer
vs.
class ServerConnection:
def __init__(self, reader, writer):
self._reader = reader
self._writer = writer
def read(self, size):
return self._reader.read(size)
def readline(self):
return self._reader.readline()
def write(self, data):
return self._writer.write(data)
def close(self):
return self._writer.close()
# …further proxy methods…
Pros/cons that I'm aware of so far:
Pros: ServerConnection is definitely some kind of bidirectional stream so it'd make sense to have it inherit from a StreamRWPair. Furthermore, it's much more convenient to use server_conn.read() than server_conn.reader.read().
Cons: I would need to define a whole lot of proxy methods. (In my case, I actually use some kind of buffered reader that has additional methods like peek() and so on.) So I thought about creating a base class StreamRWPair similar to Python's io.BufferedRWPair that my ServerConnection can then inherit from. The Python implementation of the io module, however, says the following about BufferedRWPair in the comments:
XXX The usefulness of this (compared to having two separate IO
objects) is questionable.
On top of that, my ServerConnection class has several other methods and I'm afraid that, by proxying, I will clutter its attribute namespace and make its usage less obvious.
Related
I am often writing scripts with boto3 and usually when writing functions I end up passing the boto3 client for the service(s) I need around my functions. So, for example
def main():
ec2 = create_client
long_function_with_lots_of_steps(ec2, ....)
def long_function_with_lots_of_steps(client):
....
This is not too bad, but it often feels repetitive and sometimes I will need to create a new client for a different service in the other function, for which I would like to use the original aws_session object.
Is there a way to do this more elegantly? I thought to make a class holding a boto3.session.Session() object but then you end up just passing that around.
How do you usually structure boto3 scripts?
I think you might have had some C or C++ programming experience. You are definitely getting language constructs confused. In Python function call arguments are passed by reference. So passing a reference is quick. You aren't passing the whole object.
This is in fact one of the better ways to pass in session info. Why is it better you may ask? Because of testing. You will need to test the thing and you don't always want to test the connections to 3rd party services. So you can do that with Mocks.
Try making a test where you are mocking out any one of those function arguments. Go ahead... I'll wait.
Easier... right?
Since you are basically asking for an opinion:
I usually go with your second approach. I build a base class with the session object, and build off of that. When working with a large program where I must maintain some "global" state, I make a class to house those items, and that becomes a member of my base class.
class ProgramState:
def __init__(self):
self.sesson = boto3.session.Session()
class Base:
def __init__(self, state: ProgramState):
self.state = state
class Firehose(Base):
def __init__(self, state: ProgramState):
Base.__init__(self, state)
self.client = self.state.session.client("firehose")
def do_something():
pass
class S3(Base):
def __init__(self, state: ProgramState):
Base.__init__(self, state)
self.client = self.state.session.client("s3")
def do_something_else():
pass
def main():
state = ProgramState()
firehose = Firehose(state)
s3 = S3(state)
firehose.do_something()
s3.do_something_else()
Full disclosure: I dislike Python.
I am a bit uncertain regarding thread safety and multiprocessing.
From what I can tell multiprocessing.Pool.map pickles the calling function or object but leaves members passed by references intact.
This seems like it could be beneficial since it saves memory but I haven't found any information about the thread safety in those objects.
In my case I am trying to read numpy data from disk, however, I want to be able to modify the source with out changing the implementation so I've broken out the reading part to its own classes.
I roughly have the following situation:
import numpy as np
from multiprocessing import Pool
class NpReader():
def read_row(self, row_index):
pass
class NpReaderSingleFile(NpReader):
def read_row(self, row_index):
return np.load(self._filename_from_row(row_index))
def _filename_from_row(self, row_index):
return Path(row_index).with_suffix('.npy')
class NpReaderBatch(NpReader):
def __init__(self, batch_file, mmap_mode=None):
self.batch = np.load(batch_file, mmap_mode=mmap_mode)
def read_row(self, row_index):
read_index = row_index
return self.batch[read_index]
class ProcessRow():
def __init__(self, reader):
self.reader = reader
def __call__(self, row_index):
return reader.read_row(row_index).shape
readers = [
NpReaderSingleFile(),
NpReaderBatch('batch.npy'),
NpReaderBatch('batch.npy', mmap_mode='r')
]
res = []
for reader in readers:
with Pool(12) as mp:
res.append(mp.map(ProcessRow(reader), range(100000))
It seems to me that there are alot of things that could go wrong here but I, unfortunately does not have the knowledge to determine what of test for it.
Are there any obvious problems with the above approach?
Some things that occurred to me are:
np.load (it seems to work well for small single files, but can I test it to see that it is safe?
Is NpReaderBatch safe or can read_index be modified at the same time by different processes?
I frequently find myself writing programs in Python that construct a large (megabytes) read-only data structure and then use that data structure to analyze a very large (hundreds of megabytes in total) list of small records. Each of the records can be analyzed in parallel, so a natural pattern is to set up the read-only data structure and assign it to a global variable, then create a multiprocessing.Pool (which implicitly copies the data structure into each worker process, via fork) and then use imap_unordered to crunch the records in parallel. The skeleton of this pattern tends to look like this:
classifier = None
def classify_row(row):
return classifier.classify(row)
def classify(classifier_spec, data_file):
global classifier
try:
classifier = Classifier(classifier_spec)
with open(data_file, "rt") as fp, \
multiprocessing.Pool() as pool:
rd = csv.DictReader(fp)
yield from pool.imap_unordered(classify_row, rd)
finally:
classifier = None
I'm not happy with this because of the global variable and the implicit coupling between classify and classify_row. Ideally, I would like to write
def classify(classifier_spec, data_file):
classifier = Classifier(classifier_spec)
with open(data_file, "rt") as fp, \
multiprocessing.Pool() as pool:
rd = csv.DictReader(fp)
yield from pool.imap_unordered(classifier.classify, rd)
but this does not work, because the Classifier object usually contains objects which cannot be pickled (because they are defined by extension modules whose authors didn't care about that); I have also read that it would be really slow if it did work, because the Classifier object would get copied into the worker processes on every invocation of the bound method.
Is there a better alternative? I only care about 3.x.
This was surprisingly tricky. The key here is to preserve read-access to variables that are available at fork-time without serialization. Most solutions to sharing memory in multiprocessing end up serializing. I tried using a weakref.proxy to pass in a classifier without serialization, but that didn't work because both dill and pickle will try to follow and serialize the referent. However, a module-ref works.
This organization gets us close:
import multiprocessing as mp
import csv
def classify(classifier, data_file):
with open(data_file, "rt") as fp, mp.Pool() as pool:
rd = csv.DictReader(fp)
yield from pool.imap_unordered(classifier.classify, rd)
def orchestrate(classifier_spec, data_file):
# construct a classifier from the spec; note that we can
# even dynamically import modules here, using config values
# from the spec
import classifier_module
classifier_module.init(classifier_spec)
return classify(classifier_module, data_file)
if __name__ == '__main__':
list(orchestrate(None, 'data.txt'))
A few changes to note here:
we add an orchestrate method for some DI goodness; orchestrate figures out how to construct/initialize a classifier, and hands it to classify, decoupling the two
classify only needs to assume that the classifier parameter has a classify method; it doesn't care if it's an instance or a module
For this Proof of Concept, we provide a Classifier that is obviously not serializable:
# classifier_module.py
def _create_classifier(spec):
# obviously not pickle-able because it's inside a function
class Classifier():
def __init__(self, spec):
pass
def classify(self, x):
print(x)
return x
return Classifier(spec)
def init(spec):
global __classifier
__classifier = _create_classifier(spec)
def classify(x):
return __classifier.classify(x)
Unfortunately, there's still a global in here, but it's now nicely encapsulated inside a module as a private variable, and the module exports a tight interface composed of the classify and init functions.
This design unlocks some possibilities:
orchestrate can import and init different classifier modules, based on what it sees in classifier_spec
one could also pass an instance of some Classifier class to classify, as long as this instance is serializable and has a classify method of the same signature
If you want to use forking, I don't see a way around using a global. But I also don't see a reason why you would have to feel bad about using a global in this case, you're not manipulating a global list with multi-threading or so.
It's possible to cope with the ugliness in your example, though. You want to pass classifier.classify directly, but the Classifier object contains objects which cannot be pickled.
import os
import csv
import uuid
from threading import Lock
from multiprocessing import Pool
from weakref import WeakValueDictionary
class Classifier:
def __init__(self, spec):
self.lock = Lock() # unpickleable
self.spec = spec
def classify(self, row):
return f'classified by pid: {os.getpid()} with spec: {self.spec}', row
I suggest we subclass Classifier and define __getstate__ and __setstate__ to enable pickling. Since you're using forking anyway, all state it has to pickle, is information how to get a reference to a forked global instance. Then we'll just update the pickled object's __dict__ with the __dict__ of the forked instance (which hasn't gone through the reduction of pickling) and your instance is complete again.
To achieve this without additional boilerplate, the subclassed Classifier instance has to generate a name for itself and register this as a global variable. This first reference, will be a weak reference, so the instance can be garbage collected when the user expects it. The second reference is created by the user when he assigns classifier = Classifier(classifier_spec). This one, doesn't have to be global.
The generated name in the example below is generated with help of standard-lib's uuid module. An uuid is converted to a string and edited into a valid identifier (it wouldn't have to be, but it's convenient for debugging in interactive mode).
class SubClassifier(Classifier):
def __init__(self, spec):
super().__init__(spec)
self.uuid = self._generate_uuid_string()
self.pid = os.getpid()
self._register_global()
def __getstate__(self):
"""Define pickled content."""
return {'uuid': self.uuid}
def __setstate__(self, state):
"""Set state in child process."""
self.__dict__ = state
self.__dict__.update(self._get_instance().__dict__)
def _get_instance(self):
"""Get reference to instance."""
return globals()[self.uuid][self.uuid]
#staticmethod
def _generate_uuid_string():
"""Generate id as valid identifier."""
# return 'uuid_' + '123' # for testing
return 'uuid_' + str(uuid.uuid4()).replace('-', '_')
def _register_global(self):
"""Register global reference to instance."""
weakd = WeakValueDictionary({self.uuid: self})
globals().update({self.uuid: weakd})
def __del__(self):
"""Clean up globals when deleted in parent."""
if os.getpid() == self.pid:
globals().pop(self.uuid)
The sweet thing here is, the boilerplate is totally gone. You don't have to mess manually with declaring and deleting globals since the instance manages everything itself in background:
def classify(classifier_spec, data_file, n_workers):
classifier = SubClassifier(classifier_spec)
# assert globals()['uuid_123']['uuid_123'] # for testing
with open(data_file, "rt") as fh, Pool(n_workers) as pool:
rd = csv.DictReader(fh)
yield from pool.imap_unordered(classifier.classify, rd)
if __name__ == '__main__':
PATHFILE = 'data.csv'
N_WORKERS = 4
g = classify(classifier_spec='spec1', data_file=PATHFILE, n_workers=N_WORKERS)
for record in g:
print(record)
# assert 'uuid_123' not in globals() # no reference left
The multiprocessing.sharedctypes module provides functions for allocating ctypes objects from shared memory which can be inherited by child processes, i.e., parent and children can access the shared memory.
You could use
1. multiprocessing.sharedctypes.RawArray to allocate a ctypes array from the shared memory.
2. multiprocessing.sharedctypes.RawValue to allocate a ctypes object from the shared memory.
Dr Mianzhi Wang has written a very detailed document on this. You could share multiple multiprocessing.sharedctypes objects.
You may find the solution here useful to you.
PyCharm gives me a hint this method may be static. Well, I know. But should I?
In C++ I would have defined a method static only if I had a good reason for that, mainly wanting to use the method not with any instance, or having a common data for all instances, like counters. What is the case in Python?
An answer I found says it saves memory, but I don't think the amount is such it should affect the coding style.
If a method doesn't interact with self, I'd make it #staticmethod. Especially for public APIs it's confusing if a method isn't static when it could be.
For example:
class File(object):
BUFFERSIZE = 65536
def __init__(self, path):
self.path = path
self.descriptor = None
#staticmethod
def get_buffersize():
return File.BUFFERSIZE
def open(self, mode):
self.descriptor = open(self.path, mode)
def close(self):
self.descriptor.close()
Obviously a stupid class but you get the idea, hopefully. In fact, in a real world application I'd make get_buffersize a buffersize property of File.
I am working with the Python canmatrix library (well, presently my Python3 fork) which provides a set of classes for an in-memory description of CAN network messages as well as scripts for importing and exporting to and from on-disk representations (various standard CAN description file formats).
I am writing a PyQt application using the canmatrix library and would like to add some minor additional functionality to the bottom level Signal class. Note that a CanMatrix organizes it's member Frames which in turn organize it's member Signals. The whole structure is created by an import script which reads a file. I would like to retain the import script and sub-member finder functions of each layer but add an extra 'value' member to the Signal class as well as getters/setters that can trigger Qt signals (not related to the canmatrix Signal objects).
It seems that standard inheritance approaches would require me to subclass every class in the library and override every function which creates the library Signal to use mine instead. Ditto for the import functions. This just seems horribly excessive to add non-intrusive functionality to a library.
I have tried inheriting and replacing the library class with my inherited one (with and without the pass-through constructor) but the import still creates library classes, not mine. I forget if I copied this from this other answer or not, but it's the same structure as referenced there.
class Signal(QObject, canmatrix.Signal):
_my_signal = pyqtSignal(int)
def __init__(self, *args, **kwargs):
canmatrix.Signal.__init__(self, *args, **kwargs)
# TODO: what about QObject
print('boo')
def connect(self, target):
self._my_signal.connect(target)
def set_value(self, value):
self._my_value = value
self._my_signal.emit(value)
canmatrix.Signal = Signal
print('overwritten')
Is there a direct error in my attempt here?
Am I doing this all wrong and need to go find some (other) design pattern?
My next attempt involved shadowing each instance of the library class. For any instance of the library class that I want to add the functionality to I must construct one of my objects which will associate itself with the library-class object. Then, with an extra layer, I can get from either object to the other.
class Signal(QObject):
_my_signal = pyqtSignal(int)
def __init__(self, signal):
signal.signal = self
self.signal = signal
# TODO: what about QObject parameters
QObject.__init__(self)
self.value = None
def connect(self, target):
self._my_signal.connect(target)
def set_value(self, value):
self.value = value
self._my_signal.emit(value)
The extra layer is annoying (library_signal.signal.set_value() rather than library_signal.set_value()) and the mutual references seem like they may keep both objects from ever getting cleaned up.
This does run and function, but I suspect there's still a better way.