This is a newbie question:
A class is an object, so I can create a class called pippo() and inside of this add function and parameter, I don't understand if the functions inside of pippo are executed from up to down when I assign x=pippo() or I must call them as x.dosomething() outside of pippo.
Working with Python's multiprocessing package, is it better to define a big function and create the object using the target argument in the call to Process(), or to create your own process class by inheriting from Process class?
I often wondered why Python's doc page on multiprocessing only shows the "functional" approach (using target parameter). Probably because terse, succinct code snippets are best for illustration purposes. For small tasks that fit in one function, I can see how that is the preferred way, ala:
from multiprocessing import Process
def f():
print('hello')
p = Process(target=f)
p.start()
p.join()
But when you need greater code organization (for complex tasks), making your own class is the way to go:
from multiprocessing import Process
class P(Process):
def __init__(self):
super(P, self).__init__()
def run(self):
print('hello')
p = P()
p.start()
p.join()
Bear in mind that each spawned process is initialized with a copy of the memory footprint of the master process. And that the constructor code (i.e. stuff inside __init__()) is executed in the master process -- only code inside run() executes in separate processes.
Therefore, if a process (master or spawned) changes it's member variable, the change will not be reflected in other processes. This, of course, is only true for bulit-in types, like bool, string, list, etc. You can however import "special" data structures from multiprocessing module which are then transparently shared between processes (see Sharing state between processes.) Or, you can create your own channels of IPC (inter-process communication) such as multiprocessing.Pipe and multiprocessing.Queue.
Related
I'm working on developing a little irc client in python (ver 2.7). I had hoped to use multiprocessing to read from all servers I'm currently connected to, but I'm running into an issue
import socket
import multiprocessing as mp
import types
import copy_reg
import pickle
def _pickle_method(method):
func_name = method.im_func.__name__
obj = method.im_self
cls = method.im_class
return _unpickle_method, (func_name, obj, cls)
def _unpickle_method(func_name, obj, cls):
for cls in cls.mro():
try:
func = cls.__dict__[func_name]
except KeyError:
pass
else:
break
return func.__get__(obj, cls)
copy_reg.pickle(types.MethodType, _pickle_method, _unpickle_method)
class a(object):
def __init__(self):
sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock1.connect((socket.gethostbyname("example.com"), 6667))
self.servers = {}
self.servers["example.com"] = sock1
def method(self, hostname):
self.servers[hostname].send("JOIN DAN\r\n")
print "1"
def oth_method(self):
pool = mp.Pool()
## pickle.dumps(self.method)
pool.map(self.method, self.servers.keys())
pool.close()
pool.join()
if __name__ == "__main__":
b = a()
b.oth_method()
Whenever it hits the line pool.map(self.method, self.servers.keys()) I get the error
TypeError: expected string or Unicode object, NoneType found
From what I've read this is what happens when I try to pickle something that isn't picklable. To resolve this I first made the _pickle_method and _unpickle_method as described here. Then I realized that I was (originally) trying to pass pool.map() a list of sockets (very not picklable) so I changed it to the list of hostnames, as strings can be pickled. I still get this error, however.
I then tried calling pickle.dumps() directly on self.method, self.servers.keys(), and self.servers.keys()[0]. As expected it worked fine for the latter two, but from the first I get
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled.
Some more research lead me to this question, which seems to indicate that the issue is with the use of sockets (and gnibbler's answer to that question would seem to confirm it).
Is there a way that I can actually use multiprocessing for this? From what I've (very briefly) read pathos.multiprocessing might be what I need, but I'd really like to stick to the standard library if at all possible.
I'm also not set on using multiprocessing - if multithreading would work better and avoid this issue then I'm more than open to those solutions.
Your root problem is that you can't pass sockets to child processes. The easy solution is to use threads instead.
In more detail:
Pickling a bound method requires pickling three things: the function name, the object, and the class. (I think multiprocessing does this for you automatically, but you're doing it manually; that's fine.) To pickle the object, you have to pickle its members, which in your case includes a dict whose values are sockets.
You can't pickle sockets with the default pickling protocol in Python 2.x. The answer to the question you linked explains why, and provides the simple workaround: don't use the default pickling protocol. But there's an additional problem with socket; it's just a wrapper around a type defined in a C extension module, which has its own problems with pickling. You might be able to work around that as well…
But that still isn't going to help. Under the covers, that C extension class is itself just a wrapper around a file descriptor. A file descriptor is just a number. Your operating system keeps a mapping of file descriptors to open sockets (and files and pipes and so on) for each process; file #4 in one process isn't file #4 in another process. So, you need to actually migrate the socket's file descriptor to the child at the OS level. This is not a simple thing to do, and it's different on every platform. And, of course, on top of migrating the file descriptor, you'll also have to pass enough information to re-construct the socket object. All of this is doable; there might even be a library that wraps it up for you. But it's not easy.
One alternate possibility is to open all of the sockets before launching any of the children, and set them to be inherited by the children. But, even if you could redesign your code to do things that way, this only works on POSIX systems, not on Windows.
A much simpler possibility is to just use threads instead of processes. If you're doing CPU-bound work, threads have problems in Python (well, CPython, the implementation you're almost certainly using) because there's a global interpreter lock that prevents two threads from interpreting code at the same time. But when your threads spend all their time waiting on socket.recv and similar I/O calls, there is no problem using threads. And they avoid all the overhead and complexity of pickling data and migrating sockets and so forth.
You may notice that the threading module doesn't have a nice Pool class like multiprocessing does. Surprisingly, however, there is a thread pool class in the stdlib—it's just in multiprocessing. You can access it as multiprocessing.dummy.Pool.
If you're willing to go beyond the stdlib, the concurrent.futures module from Python 3 has a backport named futures that you can install off PyPI. It includes a ThreadPoolExecutor which is a slightly higher-level abstraction around a pool which may be simpler to use. But Pool should also work fine for you here, and you've already written the code.
If you do want to try jumping out of the standard library, then the following code for pathos.multiprocessing (as you mention) should not throw pickling errors, as the dill serializer knows how to serialize sockets and file handles.
>>> import socket
>>> import pathos.multiprocessing as mp
>>> import types
>>> import dill as pickle
>>>
>>> class a(object):
... def __init__(self):
... sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
... sock1.connect((socket.gethostbyname("example.com"), 6667))
... self.servers = {}
... self.servers["example.com"] = sock1
... def method(self, hostname):
... self.servers[hostname].send("JOIN DAN\r\n")
... print "1"
... def oth_method(self):
... pool = mp.ProcessingPool()
... pool.map(self.method, self.servers.keys())
... pool.close()
... pool.join()
...
>>> b = a()
>>> b.oth_method()
One issue however is that you need serialization with multiprocessing, and in many cases the sockets will serialize so that the deserialized socket is closed. The reason is primarily because the file descriptor isn't copied as expected, it's copied by reference. With dill you can customize the serialization of file handles, so that the content does get transferred as opposed to using a reference… however, this doesn't translate well for a socket (at least at the moment).
I'm the dill and pathos author, and I'd have to agree with #abarnert that you probably don't want to do this with multiprocessing (at least not storing a map of servers and sockets). If you want to use multiprocessing's threading interface, and you find you run into any serialization concerns, pathos.multiprocessing does have mp.ThreadingPool() instead of mp.ProcessingPool(), so that you can access a wrapper around multiprocessing.dummy.Pool, but still get the additional features that pathos provides (such as multi-argument pools for blocking or asynchronous pipes and maps, etc).
I am trying to better understand Pythons modules, coming from C background mostly.
I have main.py with the following:
def g():
print obj # Need access to the object below
if __name__ == "__main__":
obj = {}
import child
child.f()
And child.py:
def f():
import main
main.g()
This particular structure of code may seem strange at first, but rest assured this is stripped from a larger project I am working on, where delegation of responsibility and decoupling forces the kind of inter-module function call sequence you see.
I need to be able to access the actual object I create when first executing main python main.py. Is this possible without explicitly sending obj as parameter around? Because I will have other variables and I don't want to send these too. If desperate, I can create a "state" object for the entire main module that I need access to, and send it around, but even that is to me a last resort. This is global variables at its simplest in C, but in Python this is a different beast I suppose (module global variables only?)
One of the solutions, excluding parameter passing at least, has turned to revolve around the fact that when executing the main Python module main as such - via f.e. python main.py where if clause suceeds and subsequently, obj is bound - the main module and its state exist and are referenced as __main__ (inspected using sys.modules dictionary). So when the child module needs the actual instance of the main module, it is not main it needs to import but __main__, otherwise two distinct copies would exist, with their own distinct states.
'Fixed' child.py:
def f():
import __main__
__main__.g()
With the following code, it seems that the queue instance passed to the worker isn't initialized:
from multiprocessing import Process
from multiprocessing.queues import Queue
class MyQueue(Queue):
def __init__(self, name):
Queue.__init__(self)
self.name = name
def worker(queue):
print queue.name
if __name__ == "__main__":
queue = MyQueue("My Queue")
p = Process(target=worker, args=(queue,))
p.start()
p.join()
This throws:
... line 14, in worker
print queue.name
AttributeError: 'MyQueue' object has no attribute 'name'
I can't re-initialize the queue, because I'll loose the original value of queue.name, even passing the queue's name as an argument to the worker (this should work, but it's not a clean solution).
So, how can inherit from multiprocessing.queues.Queue without getting this error?
On POSIX, Queue objects are shared to the child processes by simple inheritance.*
On Windows, that isn't possible, so it has to pickle the Queue, send it over a pipe to the child, and unpickle it.
(This may not be obvious, because if you actually try to pickle a Queue, you get an exception, RuntimeError: MyQueue objects should only be shared between processes through inheritance. If you look through the source, you'll see that this is really a lie—it only raises this exception if you try to pickle a Queue when multiprocess is not in the middle of spawning a child process.)
Of course generic pickling and unpickling wouldn't do any good, because you'd end up with two identical queues, not the same queue in two processes. So, multiprocessing extends things a bit, by adding a register_after_fork mechanism for objects to use when unpickling.** If you look at the source for Queue, you can see how it works.
But you don't really need to know how it works to hook it; you can hook it the same way as any other class's pickling. For example, this should work:***
def __getstate__(self):
return self.name, super(MyQueue, self).__getstate__()
def __setstate__(self, state):
self.name, state = state
super(MyQueue, self).__setstate__(state)
For more details, the pickle documentation explains the different ways you can influence how your class is pickled.
(If it doesn't work, and I haven't made a stupid mistake… then you do have to know at least a little about how it works to hook it… but most likely just to figure out whether to do your extra work before or after the _after_fork(), which would just require swapping the last two lines…)
* I'm not sure it's actually guaranteed to use simple fork inheritance on POSIX platforms. That happens to be true on 2.7 and 3.3. But there's a fork of multiprocessing that uses the Windows-style pickle-everything on all platforms for consistency, and another one that uses a hybrid on OS X to allow using CoreFoundation in single-threaded mode, or something like that, and it's clearly doable that way.
** Actually, I think Queue is only using register_after_fork for convenience, and could be rewritten without it… but it's depending on the magic that Pipe does in its _after_fork on Windows, or Lock and BoundedSemaphore on POSIX.
*** This is only correct because I happen to know, from reading the source, that Queue is a new-style class, doesn't override __reduce__ or __reduce_ex, and never returns a falsey value from __getstate__. If you didn't know that, you'd have to write more code.
This is going to be a very long question. So, pardon me
I have the following scenario, I guess it will be better to give a pseudo code to explain things better
A python file say test.py
def test(i):
from rpy2.robjects import r
r.source('r_file.R')
r.call_function(with some arguments)
#Some Operations
del r
File: r_file.R
rm(list=ls(all=TRUE))
#some global variables
#some reference class
#creating an object of reference class
call_function = function(some arguments)
{
Do some processing
call few methods on a reference class
call some more methods and do some operations
rm(list=ls(all=TRUE))
gc()
return(0)
}
The call to the the function test in python happens for some values of 'i' i.e the function gets called for some values of i which is always greater than 1 i.e the function gets invoked multiple times from main. Hence, we source the R file more than once. I wanted a new R interpreter every time I invoke the python function. Therefore, I import r every time the function is called and also delete the rpy2 object.
Within the r function "call_function", I invoke some methods, which in turn creates reference class objects.
Within the R code, I use rm in the beginning of the code and also when the function some_function exits.
Given this background, the problem which I'm facing now is that the rm does not remove any of the reference class in the code and I keep getting some warning like this
In .removePreviousCoerce(class1, class2, where, prevIs) :
methods currently exist for coercing from "Rev_R5" to "envRefClass"; they will be replaced
Here, Rev_R5 is a reference class. I do not want this to happen, is there a way to remove all the methods, objects related to the reference classes using rm ?
Removing all objects from R's global environment does not mean that you are back to a freshly started R process (class and method definitions may remain, as you discovered it).
R functions such as removeClass(), removeMethod(), or removeGeneric could be considered but unless there are objective requirements to do so (like avoid the loading of large objects over and over again), creating R processes each time might just be the safest way to go (starting an R process is relatively fast).
Since it is not possible to terminate and restart an embedded R (limitation coming from R, not rpy2), so you'll have to start and stop Python processes embedding R.
One way to do so is to use the Python package multiprocessing (included in Python's standard library). An added bonus is that the processes can be run in parallel.
Simple examle using Doug Hellmann's excellent tutorial as a base:
import multiprocessing
def R_worker(i):
"""worker function"""
print('Worker %i started' % i)
from rpy2.robjects import r
r.source('r_file.R')
r.call_function(with some arguments)
#Some Operations
del r
return
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target = R_worker, args=(i,))
jobs.append(p)
p.start()
I would like to create a Pyramid app with an orm which I am writing (currently in deep alpha status). I want to plug the orm into the app sanely and thus I want to know how global objects are handled in multithreading.
In the file:
https://www.megiforge.pl/p/elephantoplasty/source/tree/0.0.1/src/eplasty/ctx.py
you can see, there is a global object called ctx which contains a default session. What if I run set_context() and start_session() in middleware at ingress? Can I expect then to have a separate session in ctx in every thread? Or is there a risk that two threads will use the same session?
Global variables are shared between all threads, so if you run those functions the threads will conflict with each other in unpredictable ways.
To do what you want you can use thread local data, using threading.local. You need to remove the global definition of ctx and then create the following function.
def get_ctx():
thread_data = threading.local()
if not hasattr(thread_data, "ctx"):
thread_data.ctx = Ctx()
return thread_data.ctx
Then, everywhere you reference ctx call get_ctx() instead. This will ensure that your context is not shared between threads.