I know that child processes won't see changes made after a fork/spawn, and Windows processes don't inherit globals not using shared memory. But what I have is a situation where the children can't see changes to a global variable in shared memory made before the fork/spawn.
Simple demonstration:
from multiprocessing import Process, Value
global foo
foo = Value('i',1)
def printfoo():
global foo
with foo.get_lock():
print(foo.value)
if __name__ == '__main__':
with foo.get_lock():
foo.value = 2
Process(target=printfoo).start()
On Linux and MacOS, this displays the expected 2. On Windows, it displays 1, even though the modification to the global Value is made before the call to Process. How can I make the change visible to the child process on Windows, too?
The problem here is that your child process creates a new shared value, rather than using the one the parent created. Your parent process needs to explicitly send the Value to the child, for example, as an argument to the target function:
from multiprocessing import Process, Value
def use_shared_value(val):
val.value = 2
if __name__ == '__main__':
val = Value('i', 1)
p = Process(target=use_shared_value, args=(val,))
p.start()
p.join()
print(val.value)
(Unfortunately, I don't have a Windows Python install to test this on.)
Child processes cannot inherit globals on Windows, regardless of whether those globals are initialized to multiprocessing.Value instances. multiprocessing.Value does not change the fact that the child re-executes your file, and re-executing the Value construction doesn't use the shared resources the parent allocated.
Related
Given an instance method that mutates an instance variable, running this method in the ProcessPoolExecutor does run the method but does not mutate the instance variable.
from concurrent.futures import ProcessPoolExecutor
class A:
def __init__(self):
self.started = False
def method(self):
print("Started...")
self.started = True
if __name__ == "__main__":
a = A()
with ProcessPoolExecutor() as executor:
executor.submit(a.method)
assert a.started
Started...
Traceback (most recent call last):
File "/path/to/file", line 19, in <module>
assert a.started
AssertionError
Are only pure functions allowed in ProcessPoolExecutor?
For Windows
Multiprocessing does not share it's state with the child processes on Windows systems. This is because the default way to start child processes on Windows is through spawn. From the documentation for method spawn
The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver
Therefore, when you pass any objects to child processes, they are actually copied, and do not have the same memory address as in the parent process. A simple way to demonstrate this through your example would be to print the objects in the child process and the parent process:
from concurrent.futures import ProcessPoolExecutor
class A:
def __init__(self):
self.started = False
def method(self):
print("Started...")
print(f'Child proc: {self}')
self.started = True
if __name__ == "__main__":
a = A()
print(f'Parent proc: {a}')
with ProcessPoolExecutor() as executor:
executor.submit(a.method)
Output
Parent proc: <__main__.A object at 0x0000028F44B40FD0>
Started...
Child proc: <__mp_main__.A object at 0x0000019D2B8E64C0>
As you can see, both objects reside at different places in the memory. Altering one would not affect the other whatsoever. This is the reason why you don't see any changes to a.started in the parent process.
Once you understand this, your question then becomes then how to share the same object, rather than copies, to the child processes. There are numerous ways to go about this, and questions on how to share complex objects like a have already been asked and answered on stackoverflow.
For UNIX
The same could be said for other methods of starting new processes that UNIX based systems have the option of using (I am not sure the default for concurrent.futures on OSX). For example, from the documentation for multiprocessing, fork is explained as
The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
So fork creates child processes that share the entire memory space of the parent process on start. However, it uses copy-on-write to do so. What this means is that if you attempt to modify any object that is shared from within the child process, it will have to create a duplicate of that particular object as to not interrupt the parent process and localize that object to the child process (much like what spawn does on start).
Hence the answer still stands: if you plan to modify the objects passed to the child process, or if you are not on UNIX systems, you will need to share the objects yourself to have them point to the same memory address
Further reading on start methods.
I'm trying to figure out how Lock works under the hood. I run this code on MacOS which using "spawn" as default method to start new process.
from multiprocessing import Process, Lock, set_start_method
from time import sleep
def f(lock, i):
lock.acquire()
print(id(lock))
try:
print('hello world', i)
sleep(3)
finally:
lock.release()
if __name__ == '__main__':
# set_start_method("fork")
lock = Lock()
for num in range(3):
p = Process(target=f, args=(lock, num))
p.start()
p.join()
Output:
140580736370432
hello world 0
140251759281920
hello world 1
140398066042624
hello world 2
The Lock works in my code. However, the ids of lock make me confused. Since idare different, are they still same one lock or there are multiple locks and they somehow communicate secretly? Is id() still hold the position in multiprocessing, I quote "CPython implementation detail: id is the address of the object in memory."?
If I use "fork" method, set_start_method("fork"), it prints out identical id which totally make sense for me.
id is implemented as but not required to be the memory location of the given object. when using fork, the separate process does not get it's own memory space until it modifies something (copy on write), so the memory location does not change because it "is" the same object. When using spawn, an entire new process is created and the __main__ file is imported as a library into the local namesapce, so all your same functions, classes, and module level variables are accessable (sans any modifications from anything that results from if __name__ == "__main__":). Then python creates a connection between the processes (pipe) in which it can send which function to call, and the arguments to call it with. everything passing through this pipe must be pickle'd then unpickle'd. Locks specifically are re-created when un-pickling by asking the operating system for a lock with a specific name (which was created in the parent process when the lock was created, then this name is sent across using pickle). This is how the two locks are synchronized, because it is backed by an object the operating system controls. Python then stores this lock along with some other data (the PyObject as it were) in the memory of the new process. calling id now will get the location of this struct which is different because it was created by a different process in a different chunk of memory.
here's a quick example to convince you that a "spawn'ed" lock is still synchronized:
from multiprocessing import Process, Lock, set_start_method
def foo(lock):
with lock:
print(f'child process lock id: {id(lock)}')
if __name__ == "__main__":
set_start_method("spawn")
lock = Lock()
print(f'parent process lock id: {id(lock)}')
lock.acquire() #lock the lock so child has to wait
p = Process(target=foo, args=(lock,))
p.start()
input('press enter to unlock the lock')
lock.release()
p.join()
The different "id's" are the different PyObject locations, but have little to do with the underlying mutex. I am not aware that there's a direct way to inspect the underlying lock which the operating system manages.
I want to use Python's multiprocessing unit to make effective use of multiple cpu's to speed up my processing.
All seems to work, however I want to run Pool.map(f, [item, item]) from within a class, in a sub module somewhere deep in my program. The reason is that the program has to prepare the data first and wait for certain events to happen before there is anything to be processed.
The multiprocessing docs says you can only run from within a if __name__ == '__main__': statement. I don't understand the significance of that and tried it anyway, like so:
from multiprocessing import Pool
class Foo(object):
n = 1000000
def __init__(self, x):
self.x = x + 1
pass
def run(self):
for i in range(1,self.n):
self.x *= 1.0*i/self.x
return self
class Bar(object):
def __init__(self):
pass
def go_all(self):
work = [Foo(i) for i in range(960)]
def do(obj):
return obj.run()
p = Pool(16)
finished_work = p.map(do, work)
return
bar = Bar()
bar.go_all()
It indeed doesn't work! I get the following error:
PicklingError: Can't pickle : attribute lookup
builtin.function failed
I don't quite understand why as everything seems to be perfectly pickeable. I have the following questions:
Can this be made to work without putting the p.map line in my main program?
If not, can "main" programs be called as sub-routines/modules, such to make it still work?
Is there some handy trick to loop back from a submodule to the main program and run it from there?
I'm on Linux and Python 2.7
I believe you misunderstood the documentation. What the documentation says is to do this:
if __name__ == '__main__':
bar = Bar()
bar.go_all()
So your p.map line does not need to be inside your "main function", or whatever. Only the code that actually spawns the subprocesses has to be "guarded". This is unavoidable due to limitations of the Windows OS.
Moreover, the function that you pass to Pool.map has to be importable (functions are pickled simply by their names, the interpreter then has to be able to import them to rebuild the function object when they are passed to the subprocess). So you should probably move your do function at the global level to avoid pickling errors.
The extra restrictions on the multiprocessing module on ms-windows stem from the fact that it doesn't have the fork system call. On UNIX-like operating systems, fork makes a perfect copy of a process and continues to run that next to the parent process. The only difference between them is that fork returns different value in the parent and child processes.
On ms-windows, multiprocessing needs to start a new Python instance using a native method to start processes. Then it needs to bring that Python instance into the same state as the "parent" process.
This means (among other things) that the Python code must be importable without side effects like trying to start yet another process. Hence the use of the if __name__ == '__main__' guard.
I have a class variable declared as a list that I want to update from a method declared within that class. However since this method processes a large amount of data, I am using multiprocessing to invoke it and hence I need to put lock on the class variable before updating it. I am unable to figure out how to put such a lock and update the class variable. If it matters, I am only creating one object of the said class at any given time.
Because of python's GIL, multiprocessing can only be used whith completely separate tasks, and no shared memory.
But you still can make it happend by using multiprocessing shared Array/Value:
from https://docs.python.org/2/library/multiprocessing.html#sharing-state-between-processes
from multiprocessing import Process, Value, Array
def f(n, a):
n.value = 3.1415927
for i in range(len(a)):
a[i] = -a[i]
if __name__ == '__main__':
num = Value('d', 0.0)
arr = Array('i', range(10))
p = Process(target=f, args=(num, arr))
p.start()
p.join()
print num.value
print arr[:]
Now as you asked, you need to ensure that differents processes won't access the same variable at the same time, and use a Lock. Hopefuly, all the shared variable available in the multiprocessing module are paired with a Lock.
To access the lock :
num.acquire() # get the lock
# do stuff
num.release() # don't forget to release it
I hope this helps.
If you're using the multiprocessing module (as opposed to multithreading, which is different), then unless I'm mistaken, the multiple processes forked don't share memory and each process would have its own copy of your class. This would mean that a lock would not be necessary, but it would also mean that the class attribute is not shared like you want it to be.
The multiprocessing module does offer several ways to allow communication between processes, including shared array objects. Perhaps this is what you're looking for.
Depending on what you're doing, you might also consider using the master-worker pattern, where you create a worker class with methods to manipulate your data, spawn several processes to run this class, and then dispatch datasets to the workers from your main process using the Queue class from the multiprocessing module.
I'm trying to understand how Process class from the multiprocessing package works.
For this, I wrote a little example, where an object with certain value is created and then that value is changed in subprocess:
from multiprocessing import Process
class Foo:
def __init__(self):
self.value = "foo"
def run(self):
p = Process(target=self.change_value)
p.start()
p.join()
def change_value(self):
self.value = "bar"
print "inside: " + self.value
if __name__ == '__main__':
foo = Foo()
foo.run()
print "outside: " + foo.value
But this code gives me the following result:
>> inside: bar
>> outside: foo
Can someone explain me why it prints old property value ("foo") from the outside of a process despite the fact that second print is executed later?
And how to get actual value of that property ("bar") from the outside?
This is because multiprocessing.Process spawns a completely new, seperate instance of the python-environment in a new process. You will notice that in the taskmanager a new python.exe process will appear as you start the Process. It does - if you don't use the special objects such as Pipe and Queue - not share memory with the process it has been started from.
A little more about the internal work that is done:
You call p.start(). This will pickle the Process object p and spawn a new instance of the python interpreter with an own global state, etc... It does not share memory with the original process. Instead the pickled p is unpickled in the new process and work is done there.
print "inside: " + self.value: This is called by the newly spawned process thus the change done is reflected here
print "outside: " + foo.value: This is called in the original process that does have no idea about the memory of the spawned process and does not have access to it. Thus the foo is not changed in the process
What I guess you intended to use
Most likely the class you search for is threading.Thread. It offers the same interface as Process but it shares the global state and the python environment with the Thread it is started from. Any changes to objects in a spawned Thread can be read from outside.