Python multiprocessing pipe and cloudpickle - python

I have observed that this code can't run on windows, it seems that the PipeConnection handles can't be copied 1:1 so I asume that multiprocessing lib does some kind of extra work when dealing with Process args of type PipeConnection. This is a toy example of the problem:
import multiprocessing, cloudpickle
def _thunk(pipe):
def f():
test = pipe[1].recv()
pipe[1].send(test)
return f
def _worker(pickled_f):
f = cloudpickle.loads(pickled_f)
f()
if __name__ == '__main__':
pipe = multiprocessing.Pipe()
pickled_f = cloudpickle.dumps(_thunk(pipe))
multiprocessing.Process(target=_worker, args=(pickled_f,)).start()
pipe[0].send('test')
test = pipe[0].recv()
print(test)
I want to get around this but I can't modify the multiprocessing.Process call, because it is in a lib outside of my code. Other synchronization mechanisms are welcome as long as I can encapsulate the logic inside the thunk. But ideally I want to be able to reconstruct a working Pipe in my new process.
Thanks.

Related

Pathos multiprocessing in class produces garbled standard output

I'm trying to use multiprocessing in a class I have written to speed up calculations. I'm using pathos.multiprocessing and dill, and using map on a ProcessingPool. I've tested the functionality of multiprocessing in a console and it performed as expected. The issue I'm having is that when I try to implement it in my code, as soon as it calls pool.map the terminal I'm using starts spitting out ridiculous nonsense. The output is recognizable as being from the code, but I have no idea how it's making it print. Some of it comes from a method like I defined below, where it includes the current datetime. In the nonsense I see that it's printing the current time, after pool.map was called, so this isn't just something that's just being repeatedly printed out, it's new output. Here is a little code illustrating how I'm using multiprocessing.
My_func is a little more complicated than I have below, but as a first step I changed it to literally what is written below, and the problem still persists.
Additionally, Ctr-C does trigger a KeyboardInterrupt, but does not completely stop the program. I'm using Visual Studio and python 2.7.13 on Windows 10.
from pathos.multiprocessing import ProcessingPool
import dill
import datetime
class my_class(Object):
def __init__(self):
pool = ProcessingPool(nodes=4)
p1 = [1,2,3]
p2 = [4,5,6]
p3 = [7,8,9]
results = pool.map(self.my_func, p1, p2, p3)
def my_func(self,x,y,z):
print(x,y,z)
def status_printout(self,message):
header = datetime.datetime.now().strftime('%Y/%m/%d %H:%M:%S')
print(header+' -- '+message)
Try using a Lock to ensure only one of the subprocesses writes to stdout at a time.
I was not using the suggested
if __name__ == '__main__':
freeze_support()
for Windows. Things are behaving normally now.

How to use multiprocessing.Pool in an imported module?

I have not been able to implement the suggestion here: Applying two functions to two lists simultaneously.
I guess it is because the module is imported by another module and thus my Windows spawns multiple python processes?
My question is: how can I use the code below without the if if __name__ == "__main__":
args_m = [(mortality_men, my_agents, graveyard, families, firms, year, agent) for agent in males]
args_f = [(mortality_women, fertility, year, families, my_agents, graveyard, firms, agent) for agent in females]
with mp.Pool(processes=(mp.cpu_count() - 1)) as p:
p.map_async(process_males, args_m)
p.map_async(process_females, args_f)
Both process_males and process_females are fuctions.
args_m, args_f are iterators
Also, I don't need to return anything. Agents are class instances that need updating.
The reason you need to guard multiprocessing code in a if __name__ == "__main__" is that you don't want it to run again in the child process. That can happen on Windows, where the interpreter needs to reload all of its state since there's no fork system call that will copy the parent process's address space. But you only need to use it where code is supposed to be running at the top level since you're in the main script. It's not the only way to guard your code.
In your specific case, I think you should put the multiprocessing code in a function. That won't run in the child process, as long as nothing else calls the function when it should not. Your main module can import the module, then call the function (from within an if __name__ == "__main__" block, probably).
It should be something like this:
some_module.py:
def process_males(x):
...
def process_females(x):
...
args_m = [...] # these could be defined inside the function below if that makes more sense
args_f = [...]
def do_stuff():
with mp.Pool(processes=(mp.cpu_count() - 1)) as p:
p.map_async(process_males, args_m)
p.map_async(process_females, args_f)
main.py:
import some_module
if __name__ == "__main__":
some_module.do_stuff()
In your real code you might want to pass some arguments or get a return value from do_stuff (which should also be given a more descriptive name than the generic one I've used in this example).
The idea of if __name__ == '__main__': is to avoid infinite process spawning.
When pickling a function defined in your main script, python has to figure out what part of your main script is the function code. It will basically re run your script. If your code creating the Pool is in the same script and not protected by the "if main", then by trying to import the function, you will try to launch another Pool that will try to launch another Pool....
Thus you should separate the function definitions from the actual main script:
from multiprocessing import Pool
# define test functions outside main
# so it can be imported withou launching
# new Pool
def test_func():
pass
if __name__ == '__main__':
with Pool(4) as p:
r = p.apply_async(test_func)
... do stuff
result = r.get()
Cannot yet comment on the question, but a workaround I have used that some have mentioned is just to define the process_males etc. functions in a module that is different to where the processes are spawned. Then import the module containing the multiprocessing spawns.
I solved it by calling the modules' multiprocessing function within "if __ name__ == "__ main__":" of the main script, as the function that involves multiprocessing is the last step in my module, others could try if aplicable.

multiprocessing launch from within module or class, not from main()

I want to use Python's multiprocessing unit to make effective use of multiple cpu's to speed up my processing.
All seems to work, however I want to run Pool.map(f, [item, item]) from within a class, in a sub module somewhere deep in my program. The reason is that the program has to prepare the data first and wait for certain events to happen before there is anything to be processed.
The multiprocessing docs says you can only run from within a if __name__ == '__main__': statement. I don't understand the significance of that and tried it anyway, like so:
from multiprocessing import Pool
class Foo(object):
n = 1000000
def __init__(self, x):
self.x = x + 1
pass
def run(self):
for i in range(1,self.n):
self.x *= 1.0*i/self.x
return self
class Bar(object):
def __init__(self):
pass
def go_all(self):
work = [Foo(i) for i in range(960)]
def do(obj):
return obj.run()
p = Pool(16)
finished_work = p.map(do, work)
return
bar = Bar()
bar.go_all()
It indeed doesn't work! I get the following error:
PicklingError: Can't pickle : attribute lookup
builtin.function failed
I don't quite understand why as everything seems to be perfectly pickeable. I have the following questions:
Can this be made to work without putting the p.map line in my main program?
If not, can "main" programs be called as sub-routines/modules, such to make it still work?
Is there some handy trick to loop back from a submodule to the main program and run it from there?
I'm on Linux and Python 2.7
I believe you misunderstood the documentation. What the documentation says is to do this:
if __name__ == '__main__':
bar = Bar()
bar.go_all()
So your p.map line does not need to be inside your "main function", or whatever. Only the code that actually spawns the subprocesses has to be "guarded". This is unavoidable due to limitations of the Windows OS.
Moreover, the function that you pass to Pool.map has to be importable (functions are pickled simply by their names, the interpreter then has to be able to import them to rebuild the function object when they are passed to the subprocess). So you should probably move your do function at the global level to avoid pickling errors.
The extra restrictions on the multiprocessing module on ms-windows stem from the fact that it doesn't have the fork system call. On UNIX-like operating systems, fork makes a perfect copy of a process and continues to run that next to the parent process. The only difference between them is that fork returns different value in the parent and child processes.
On ms-windows, multiprocessing needs to start a new Python instance using a native method to start processes. Then it needs to bring that Python instance into the same state as the "parent" process.
This means (among other things) that the Python code must be importable without side effects like trying to start yet another process. Hence the use of the if __name__ == '__main__' guard.

can't multiprocess user defined code - cannot pickle

I tried using multiprocessing on my task, which generally means to do some calculation and then pass back the result. The problem is that the code defining the calculation is defined by user, it is compiled from string before the execution. This works perfect using exec(), eval() or compile() etc. when being run in the main process. The example below works only for f1 function but not for f2. I get 'Can't pickle class 'code'`. Is there any way round this? For example using multiprocessing differently? Or using other package? Or some more low level stuff? Unfortunatelly passing the string to the process and then compiling inside the process is not an option for me because of the design of the whole application (i.e. the code string is 'lost' and only the compiled version is available).
import multiprocessing
def callf(f, a):
exec(f, {'a': a})
if __name__ == "__main__":
f = compile("print(a)", filename="<string>", mode="exec")
callf(f, 10) # this works
process = multiprocessing.Process(target=callf, args=(f, 20)) # this does not work
process.start()
process.join()
UPDATE: here is another attempt, which is actually closer to my actual need. It results in different error message, but also cannot pickle the function.
import multiprocessing
if __name__ == "__main__":
source = "def f(): print('done')"
locals = dict()
exec(source, {}, locals)
f = locals['f']
f() # this works
process = multiprocessing.Process(target=f) # this does not work
process.start()
process.join()
pickle can't serialize code objects but dill can. There is a dill-based fork of multiprocessing called multiprocessing_on_dill but I have no idea how good it is. You could also just dill-encode the code object to make standard multiprocessing happy.
import multiprocessing
import dill
def callf_dilled(f_dilled, a):
return callf(dill.loads(f_dilled), a)
def callf(f, a):
exec(f, {'a': a})
if __name__ == "__main__":
f = compile("print(a)", filename="<string>", mode="exec")
callf(f, 10) # this works
process = multiprocessing.Process(target=callf_dilled,
args=(dill.dumps(f), 20)) # now this works too!
process.start()
process.join()

Is it possible to use multiprocessing in a module with windows?

I'm currently going through some pre-existing code with the goal of speeding it up. There's a few places that are extremely good candidates for parallelization. Since Python has the GIL, I thought I'd use the multiprocess module.
However from my understanding the only way this will work on windows is if I call the function that needs multiple processes from the highest-level script with the if __name__=='__main__' safeguard. However, this particular program was meant to be distributed and imported as a module, so it'd be kind of clunky to have the user copy and paste that safeguard and is something I'd really like to avoid doing.
Am I out of luck or misunderstanding something as far as multiprocessing goes? Or is there any other way to do it with Windows?
For everyone still searching:
inside module
from multiprocessing import Process
def printing(a):
print(a)
def foo(name):
var={"process":{}}
if name == "__main__":
for i in range(10):
var["process"][i] = Process(target=printing , args=(str(i)))
var["process"][i].start()
for i in range(10):
var["process"][i].join
inside main.py
import data
name = __name__
data.foo(name)
output:
>>2
>>6
>>0
>>4
>>8
>>3
>>1
>>9
>>5
>>7
I am a complete noob so please don't judge the coding OR presentation but at least it works.
As explained in comments, perhaps you could do something like
#client_main.py
from mylib.mpSentinel import MPSentinel
#client logic
if __name__ == "__main__":
MPSentinel.As_master()
#mpsentinel.py
class MPSentinel(object):
_is_master = False
#classmethod
def As_master(cls):
cls._is_master = True
#classmethod
def Is_master(cls):
return cls._is_master
It's not ideal in that it's effectively a singleton/global but it would work around window's lack of fork. Still you could use MPSentinel.Is_master() to use multiprocessing optionally and it should prevent Windows from process bombing.
On ms-windows, you should be able to import the main module of a program without side effects like starting a process.
When Python imports a module, it actually runs it.
So one way of doing that is in the if __name__ is '__main__' block.
Another way is to do it from within a function.
The following won't work on ms-windows:
from multiprocessing import Process
def foo():
print('hello')
p = Process(target=foo)
p.start()
This is because it tries to start a process when importing the module.
The following example from the programming guidelines is OK:
from multiprocessing import Process, freeze_support, set_start_method
def foo():
print('hello')
if __name__ == '__main__':
freeze_support()
set_start_method('spawn')
p = Process(target=foo)
p.start()
Because the code in the if block doesn't run when the module is imported.
But putting it in a function should also work:
from multiprocessing import Process
def foo():
print('hello')
def bar()
p = Process(target=foo)
p.start()
When this module is run, it will define two new functions, not run then.
i've been developing an instagram images scraper so in order to get the download & save operations run faster i've implemented multiprocesing in one auxiliary module, note that this code it's inside an auxiliary module and not inside the main module.
The solution I found is adding this line:
if __name__ != '__main__':
pretty simple but it's actually working!
def multi_proces(urls, profile):
img_saved = 0
if __name__ != '__main__': # line needed for the sake of getting this NOT to crash
processes = []
for url in urls:
try:
process = multiprocessing.Process(target=download_save, args=[url, profile, img_saved])
processes.append(process)
img_saved += 1
except:
continue
for proce in processes:
proce.start()
for proce in processes:
proce.join()
return img_saved
def download_save(url, profile,img_saved):
file = requests.get(url, allow_redirects=True) # Download
open(f"scraped_data\{profile}\{profile}-{img_saved}.jpg", 'wb').write(file.content) # Save

Categories

Resources