What is the benefit of using a context mananger with multiprocessing.Manager? - python

In the documentation, Manager is used with a context manager (i.e. with) like so:
from multiprocessing.managers import BaseManager
class MathsClass:
def add(self, x, y):
return x + y
def mul(self, x, y):
return x * y
class MyManager(BaseManager):
pass
MyManager.register('Maths', MathsClass)
if __name__ == '__main__':
with MyManager() as manager:
maths = manager.Maths()
print(maths.add(4, 3)) # prints 7
print(maths.mul(7, 8)) # prints 56
But what is the benefit of this, with the exception of the namespace? For opening file streams, the benefit is quite obvious in that you don't have to manually .close() the connection, but what is it for Manager? If you don't use it in a context, what steps do you have to use to ensure that everything is closed properly?
In short, what is the benefit of using the above over something like:
manager = MyManager()
maths = manager.Maths()
print(maths.add(4, 3)) # prints 7
print(maths.mul(7, 8)) # prints 56

But what is the benefit of this (...)?
First, you get the primary benefit of almost any context managers. You have a well-defined lifetime for the resource. It is allocated and acquired when the with ...: block is opened. It is released when the blocks ends (either by reaching the end or because an exception is raised). It is still deallocated whenever the garbage collector gets around to it but this is of less concern since the external resource has already been released.
In the case of multiprocessing.Manager (which is a function that returns a SyncManager, even though Manager looks lot like a class), the resource is a "server" process that holds state and a number of worker processes that share that state.
what is [the benefit of using a context manager] for Manager?
If you don't use a context manager and you don't call shutdown on the manager then the "server" process will continue running until the SyncManager's __del__ is run. In some cases, this might happen soon after the code that created the SyncManager is done (for example, if it is created inside a short function and the function returns normally and you're using CPython then the reference counting system will probably quickly notice the object is dead and call its __del__). In other cases, it might take longer (if an exception is raised and holds on to a reference to the manager then it will be kept alive until that exception is dealt with). In some bad cases, it might never happen at all (if SyncManager ends up in a reference cycle then its __del__ will prevent the cycle collector from collecting it at all; or your process might crash before __del__ is called). In all these cases, you're giving up control of when the extra Python processes created by SyncManager are cleaned up. These processes may represent non-trivial resource usage on your system. In really bad cases, if you create SyncManager in a loop, you may end up creating many of these that live at the same time and could easily consume huge quantities of resources.
If you don't use it in a context, what steps do you have to use to ensure that everything is closed properly?
You have to implement the context manager protocol yourself, as you would for any context manager you used without with. It's tricky to do in pure-Python while still being correct. Something like:
manager = None
try:
manager = MyManager()
manager.__enter__()
# use it ...
except:
if manager is not None:
manager.__exit__(*exc_info())
else:
if manager is not None:
manager.__exit__(None, None, None)
start and shutdown are also aliases of __enter__ and __exit__, respectively.

Related

Evaluate and assign expression in or before with statement

If I am correct, with statement doesn't introduce a local scope for the with statement.
These are examples from Learning Python:
with open(r'C:\misc\data') as myfile:
for line in myfile:
print(line)
...more code here...
and
lock = threading.Lock() # After: import threading
with lock:
# critical section of code
...access shared resources...
Is the second example equivalent to the following rewritten in a way similar to the first example?
with threading.Lock() as lock:
# critical section of code
...access shared resources...
What are their differences?
Is the first example equivalent to the following rewritten in a way similar to the second example?
myfile = open(r'C:\misc\data')
with myfile:
for line in myfile:
print(line)
...more code here...
What are their differences?
When with enters a context, it calls a hook on the context manager object, called __enter__, and the return value of that hook can optionally be assigned to a name using as <name>. Many context managers return self from their __enter__ hook. If they do, then you can indeed take your pick between creating the context manager on a separate line or capturing the object with as.
Out of your two examples, only the file object returned from open() has an __enter__ hook that returns self. For threading.Lock(), __enter__ returns the same value as Lock.acquire(), so a boolean, not the lock object itself.
You'll need to look for explicit documentation that confirms this; this is not always that clear however. For Lock objects, the relevant section of the documentation states:
All of the objects provided by this module that have acquire() and release() methods can be used as context managers for a with statement. The acquire() method will be called when the block is entered, and release() will be called when the block is exited.
and for file objects, the IOBase documentation is rather on the vague side and you have to infer from the example that the file object is returned.
The main thing to take away is that returning self is not mandatory, nor is it always desired. Context managers are entirely free to return something else. For example, many database connection objects are context managers that let you manage the transaction (roll back or commit automatically, depending on whether or not there was an exception), where entering returns a new cursor object bound to the connection.
To be explicit:
for your open() example, the two examples are for all intents and purposes exactly the same. Both call open(), and if that does not raise an exception, you end up with a reference to that file object named myfile. In both cases the file object will be closed after the with statement is done. The name continues to exist after the with statement is done.
There is a difference, but it is mostly technical. For with open(...) as myfile:, the file object is created, has it's __enter__ method called and then myfile is bound. For the myfile = open(...) case, myfile is bound first, __enter__ called later.
For your with threading.Lock() as lock: example, using as lock will set lock to a True (locking always either succeeds or blocks indefinitely this way). This differs from the lock = threading.Lock() case, where lock is bound to the lock object.
Here's a good explanation. I'll paraphrase the key part:
The with statement could be thought of like this code:
set things up
try:
do something
finally:
tear things down
Here, “set things up” could be opening a file, or acquiring some sort of external resource, and “tear things down” would then be closing the file, or releasing or removing the resource. The try-finally construct guarantees that the “tear things down” part is always executed, even if the code that does the work doesn’t finish.

Updating the same instance variables from different processes

Here is a simple secinaro:
class Test:
def __init__(self):
self.foo = []
def append(self, x):
self.foo.append(x)
def get(self):
return self.foo
def process_append_queue(append_queue, bar):
while True:
x = append_queue.get()
if x is None:
break
bar.append(x)
print("worker done")
def main():
import multiprocessing as mp
bar = Test()
append_queue = mp.Queue(10)
append_queue_process = mp.Process(target=process_append_queue, args=(append_queue, bar))
append_queue_process.start()
for i in range(100):
append_queue.put(i)
append_queue.put(None)
append_queue_process.join()
print str(bar.get())
if __name__=="__main__":
main()
When you call bar.get() at the end of the main() function why does it still return an empty list? How can I make it so that the child process also works with the same instance of Test not a new one?
All answers appreciated!
In general, processes have distinct address spaces, so that mutations of an object in one process have no effect on any object in any other process. Interprocess communication is needed to tell a process about changes made in another process.
That can be done explicitly (using things like multiprocessing.Queue), or implicitly if you use a facility implemented by multiprocessing for this purpose. For example, a great deal of work is done under the covers to make changes to a multiprocessing.Queue visible across processes.
The easiest way in your specific example is to replace your __init__ function like so:
def __init__(self):
import multiprocessing as mp
self.foo = mp.Manager().list()
It so happens that an mp.Manager instance supports a list() method that creates a process-aware list object (really a proxy for a list object, which forwards list operations to an under-the-covers server process that maintains a single copy of "the real" list - the list object isn't really shared across processes, because that's impossible - but the proxies make it appear to be shared).
So if you make that change, your code will display the results you expect - and there is no simpler way.
Note that multiprocessing works better the less IPC (interprocess communication) you need, and that's true pretty much regardless of application or programming language.
Objects are copied between processes by pickling them and passing the string over a pipe. There is no way to achieve true "shared memory" for pure Python objects between processes. To achieve precisely this type of synchronization, take a look at the multiprocessing.Manager documentation (https://docs.python.org/2/library/multiprocessing.html#managers) which provides you with examples about synchronized versions of common Python container types. These are "proxied" containers where operations on the proxy send all arguments across the process boundary, pickled, and are then executed in the parent process.

Python Twisted's DeferredLock

Can someone provide an example and explain when and how to use Twisted's DeferredLock.
I have a DeferredQueue and I think I have a race condition I want to prevent, but I'm unsure how to combine the two.
Use a DeferredLock when you have a critical section that is asynchronous and needs to be protected from overlapping (one might say "concurrent") execution.
Here is an example of such an asynchronous critical section:
class NetworkCounter(object):
def __init__(self):
self._count = 0
def next(self):
self._count += 1
recording = self._record(self._count)
def recorded(ignored):
return self._count
recording.addCallback(recorded)
return recording
def _record(self, value):
return http.GET(
b"http://example.com/record-count?value=%d" % (value,))
See how two concurrent uses of the next method will produce "corrupt" results:
from __future__ import print_function
counter = NetworkCounter()
d1 = counter.next()
d2 = counter.next()
d1.addCallback(print, "d1")
d2.addCallback(print, "d2")
Gives the result:
2 d1
2 d2
This is because the second call to NetworkCounter.next begins before the first call to that method has finished using the _count attribute to produce its result. The two operations share the single attribute and produce incorrect output as a consequence.
Using a DeferredLock instance will solve this problem by preventing the second operation from beginning until the first operation has completed. You can use it like this:
class NetworkCounter(object):
def __init__(self):
self._count = 0
self._lock = DeferredLock()
def next(self):
return self._lock.run(self._next)
def _next(self):
self._count += 1
recording = self._record(self._count)
def recorded(ignored):
return self._count
recording.addCallback(recorded)
return recording
def _record(self, value):
return http.GET(
b"http://example.com/record-count?value=%d" % (value,))
First, notice that the NetworkCounter instance creates its own DeferredLock instance. Each instance of DeferredLock is distinct and operates independently from any other instance. Any code that participates in the use of a critical section needs to use the same DeferredLock instance in order for that critical section to be protected. If two NetworkCounter instances somehow shared state then they would also need to share a DeferredLock instance - not create their own private instance.
Next, see how DeferredLock.run is used to call the new _next method (into which all of the application logic has been moved). NetworkCounter (nor the application code using NetworkCounter) does not call the method that contains the critical section. DeferredLock is given responsibility for doing this. This is how DeferredLock can prevent the critical section from being run by multiple operations at the "same" time. Internally, DeferredLock will keep track of whether an operation has started and not yet finished. It can only keep track of operation completion if the operation's completion is represented as a Deferred though. If you are familiar with Deferreds, you probably already guessed that the (hypothetical) HTTP client API in this example, http.GET, is returning a Deferred that fires when the HTTP request has completed. If you are not familiar with them yet, you should go read about them now.
Once the Deferred that represents the result of the operation fires - in other words, once the operation is done, DeferredLock will consider the critical section "out of use" and allow another operation to begin executing it. It will do this by checking to see if any code has tried to enter the critical section while the critical section was in use and if so it will run the function for that operation.
Third, notice that in order to serialize access to the critical section, DeferredLock.run must return a Deferred. If the critical section is in use and DeferredLock.run is called it cannot start another operation. Therefore, instead, it creates and returns a new Deferred. When the critical section goes out of use, the next operation can start and when that operation completes, the Deferred returned by the DeferredLock.run call will get its result. This all ends up looking rather transparent to any users who are already expecting a Deferred - it just means the operation appears to take a little longer to complete (though the truth is that it likely takes the same amount of time to complete but has it wait a while before it starts - the effect on the wall clock is the same though).
Of course, you can achieve a concurrent-use safe NetworkCounter more easily than all this by simply not sharing state in the first place:
class NetworkCounter(object):
def __init__(self):
self._count = 0
def next(self):
self._count += 1
result = self._count
recording = self._record(self._count)
def recorded(ignored):
return result
recording.addCallback(recorded)
return recording
def _record(self, value):
return http.GET(
b"http://example.com/record-count?value=%d" % (value,))
This version moves the state used by NetworkCounter.next to produce a meaningful result for the caller out of the instance dictionary (ie, it is no longer an attribute of the NetworkCounter instance) and into the call stack (ie, it is now a closed over variable associated with the actual frame that implements the method call). Since each call creates a new frame and a new closure, concurrent calls are now independent and no locking of any sort is required.
Finally, notice that even though this modified version of NetworkCounter.next still uses self._count which is shared amongst all calls to next on a single NetworkCounter instance this can't cause any problems for the implementation when it is used concurrently. In a cooperative multitasking system such as the one primarily used with Twisted, there are never context switches in the middle of functions or operations. There cannot be a context switch from one operation to another in between the self._count += 1 and result = self._count lines. They will always execute atomically and you don't need locks around them to avoid re-entrancy or concurrency induced corruption.
These last two points - avoiding concurrency bugs by avoiding shared state and the atomicity of code inside a function - combined means that DeferredLock isn't often particularly useful. As a single data point, in the roughly 75 KLOC in my current work project (heavily Twisted based), there are no uses of DeferredLock.

python multiprocessing scheduling task

I have 8 CPU core and 200 tasks to done. Each tasks are isolate. There is no need to wait or share the result. I'm looking for a way to run 8 tasks/processes at a time (Maximum) and when one of them finished. The remaining task will automatic start process.
How to know when the child process was done and start a new child process. First I'm trying to use process(multiprocessing) and it's hard to figure out. Then I try to use pool and face with the pickle problem cause I need to use dynamic instantiate.
Edited : Adding my code of Pool
class Collectorparallel():
def fire(self,obj):
collectorController = Collectorcontroller()
collectorController.crawlTask(obj)
def start(self):
log_to_stderr(logging.DEBUG)
pluginObjectList = []
for pluginName in self.settingModel.getAllCollectorName():
name = pluginName.capitalize()
#Get plugin class and instanitiate object
module = __import__('plugins.'+pluginName,fromlist=[name])
pluginClass = getattr(module,name)
pluginObject = pluginClass()
pluginObjectList.append(pluginObject)
pool = Pool(8)
jobs = pool.map(self.fire,pluginObjectList)
pool.close()
print pluginObjectList
pluginObjectList got something like
[<plugins.name1.Name1 instance at 0x1f54290>, <plugins.name2.Name2 instance at 0x1f54f38>]
PicklingError: Can't pickle : attribute lookup builtin.instancemethod failed
but the Process version work fine
Warning this is kinda subjective to deployment and situation but my current setup is as follows
I have a worker program, I fire up 6 copies (I have 6 cores).
Each worker does the following;
Connect to a Redis instance
Try and pop some work of a specific list
Pushes back logging information
Either idles or terminates on a lack of work in the 'queue'
Then each program is essentially standalone while still doing the work you require with a separate queuing system. As you have no go-between on your processes, this might be a solution to your problem.
I'm not an expert in multiprocessing in Python but I tried some fiew things with this help http://www.tutorialspoint.com/python/python_multithreading.htm and this one too http://www.devshed.com/c/a/Python/Basic-Threading-in-Python/1/ .
You can for example use this method isAlive which answering your question.
The solution to your problem is trivial. First of all, note that methods cannot be pickled. In fact only the types listed in pickle's documentation can be pickled:
None, True, and False
integers, long integers, floating point numbers, complex numbers
normal and Unicode strings
tuples, lists, sets, and dictionaries containing only picklable objects
functions defined at the top level of a module
built-in functions defined at the top level of a module
classes that are defined at the top level of a module
instances of such classes whose __dict__ or the result of calling __getstate__() is picklable (see section The pickle protocol
for details).
[...]
Note that functions (built-in and user-defined) are pickled by
“fully qualified” name reference, not by value. This means that
only the function name is pickled, along with the name of the module the function is defined in. Neither the function’s code, nor any of
its function attributes are pickled. Thus the defining module must be
importable in the unpickling environment, and the module must contain
the named object, otherwise an exception will be raised. [4]
Similarly, classes are pickled by named reference, so the same
restrictions in the unpickling environment apply. Note that none of
the class’s code or data is pickled[...]
Clearly a method isn't a function defined at the top level of a module, hence it cannot be pickled.(read carefully that part of the documentation to avoid future problems with pickle!) But it is absolutely trivial to replace the method with a global function and passing self as additional parameter:
import itertools as it
def global_fire(argument):
self, obj = argument
self.fire(obj)
class Collectorparallel():
def fire(self,obj):
collectorController = Collectorcontroller()
collectorController.crawlTask(obj)
def start(self):
log_to_stderr(logging.DEBUG)
pluginObjectList = []
for pluginName in self.settingModel.getAllCollectorName():
name = pluginName.capitalize()
#Get plugin class and instanitiate object
module = __import__('plugins.'+pluginName,fromlist=[name])
pluginClass = getattr(module,name)
pluginObject = pluginClass()
pluginObjectList.append(pluginObject)
pool = Pool(8)
jobs = pool.map(global_fire, zip(it.repeat(self), pluginObjectList))
pool.close()
print pluginObjectList
Note that, since Pool.map calls the given function with only one argument, we have to "pack together" both self and the actual argument. To do this I have zipped it.repeat(self) and the original iterable.
If you do not care about the order in which the calls are done then using pool.imap_unordered might provide better performances. However it returns an iterable and not a list, so if you want the list of results you'll have to do jobs = list(pool.imap_unordered(...)).
I believe that this code will remove all pickling problems.
class Collectorparallel():
def __call__(self,cNames):
for pluginName in cNames:
name = pluginName.capitalize()
#Get plugin class and instanitiate object
module = __import__('plugins.'+pluginName,fromlist=[name])
pluginClass = getattr(module,name)
pluginObject = pluginClass()
pluginObjectList.append(pluginObject)
collectorController = Collectorcontroller()
collectorController.crawlTask(obj)
def start(self):
log_to_stderr(logging.DEBUG)
pool = Pool(8)
jobs = pool.map(self,self.settingModel.getAllCollectorName())
pool.close()
What has happened here is that Collectorparallel has been turned into a callable. The list of plugin names is used as the iterable for the pool, the actual determination of the plugins and their instantiation is done in each of the worker processes, and the class instance object is used as the callable for each worker process.

A thread-safe memoize decorator

I'm trying to make a memoize decorator that works with multiple threads.
I understood that I need to use the cache as a shared object between the threads, and acquire/lock the shared object. I'm of course launching the threads:
for i in range(5):
thread = threading.Thread(target=self.worker, args=(self.call_queue,))
thread.daemon = True
thread.start()
where worker is:
def worker(self, call):
func, args, kwargs = call.get()
self.returns.put(func(*args, **kwargs))
call.task_done()
The problem starts, of course, when I'm sending a function decorated with a memo function (like this) to many threads at the same time.
How can I implement the memo's cache as a shared object among threads?
The most straightforward way is to employ a single lock for the entire cache, and require that any writes to the cache grab the lock first.
In the example code you posted, at line 31, you would acquire the lock and check to see if the result is still missing, in which case you would go ahead and compute and cache the result. Something like this:
lock = threading.Lock()
...
except KeyError:
with lock:
if key in self.cache:
v = self.cache[key]
else:
v = self.cache[key] = f(*args,**kwargs),time.time()
The example you posted stores a cache per function in a dictionary, so you'd need to store a lock per function as well.
If you were using this code in a highly contentious environment, though, it would probably be unacceptably inefficient, since threads would have to wait on each other even if they weren't calculating the same thing. You could probably improve on this by storing a lock per key in your cache. You'll need to globally lock access to the lock storage as well, though, or else there's a race condition in creating the per-key locks.

Categories

Resources