can't multiprocess user defined code - cannot pickle - python

I tried using multiprocessing on my task, which generally means to do some calculation and then pass back the result. The problem is that the code defining the calculation is defined by user, it is compiled from string before the execution. This works perfect using exec(), eval() or compile() etc. when being run in the main process. The example below works only for f1 function but not for f2. I get 'Can't pickle class 'code'`. Is there any way round this? For example using multiprocessing differently? Or using other package? Or some more low level stuff? Unfortunatelly passing the string to the process and then compiling inside the process is not an option for me because of the design of the whole application (i.e. the code string is 'lost' and only the compiled version is available).
import multiprocessing
def callf(f, a):
exec(f, {'a': a})
if __name__ == "__main__":
f = compile("print(a)", filename="<string>", mode="exec")
callf(f, 10) # this works
process = multiprocessing.Process(target=callf, args=(f, 20)) # this does not work
process.start()
process.join()
UPDATE: here is another attempt, which is actually closer to my actual need. It results in different error message, but also cannot pickle the function.
import multiprocessing
if __name__ == "__main__":
source = "def f(): print('done')"
locals = dict()
exec(source, {}, locals)
f = locals['f']
f() # this works
process = multiprocessing.Process(target=f) # this does not work
process.start()
process.join()

pickle can't serialize code objects but dill can. There is a dill-based fork of multiprocessing called multiprocessing_on_dill but I have no idea how good it is. You could also just dill-encode the code object to make standard multiprocessing happy.
import multiprocessing
import dill
def callf_dilled(f_dilled, a):
return callf(dill.loads(f_dilled), a)
def callf(f, a):
exec(f, {'a': a})
if __name__ == "__main__":
f = compile("print(a)", filename="<string>", mode="exec")
callf(f, 10) # this works
process = multiprocessing.Process(target=callf_dilled,
args=(dill.dumps(f), 20)) # now this works too!
process.start()
process.join()

Related

Python multiprocessing with macOs

I have a mac (MacOs 10.15.4, Python ver 3.82) and need to work in multiprocessing, but on my pc the procedures doesn’t work.
For example, I have copied a simple parallel python program
import multiprocessing as mp
import time
def test_function(i):
print("function starts" + str(i))
time.sleep(1)
print("function ends" + str(i))
if __name__ == '__main__':
pool = mp.Pool(mp.cpu_count())
pool.map(test_function, [i for i in range(4)])
pool.close()
pool.join()
What I expect to see in the output:
function starts0
function ends0
function starts1
function ends1
function starts2
function ends2
function starts3
function ends3
Or similar...
What I actually see:
= RESTART: /Users/Simulazioni/prova.py
>>>
Just nothing, no errors and no informations, just nothing. I have already try mamy procedure without results. The main problem, I could see, is the call of the function, in fact the instruction:
if __name__ == '__main__':
doesn’t call the function,
def test_function(i):
I tried many example of that kind without results.
Is ti possible and/or what's is the easiest way to parallelize in macOs?
I know this is a bit of an old question, but I just faced the same issue and I solved it by using a different multithreading package's version, which is:
import multiprocess
instead of:
import multiprocessing
Reference:
https://pypi.org/project/multiprocess/

Multiprocessing & Pool in __main__ - how to get the output outside the __main__?

Based on this answer (https://stackoverflow.com/a/20192251/9024698), I have to do this:
from multiprocessing import Pool
def process_image(name):
sci=fits.open('{}.fits'.format(name))
<process>
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
pool.map(process_image, data_inputs) # process data_inputs iterable with pool
to multi-process a for loop.
However, I am wondering, how can I get the output of this and further process if I want?
It must be like that:
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
output = pool.map(process_image, data_inputs) # process data_inputs iterable with pool
# further processing
But then this means that I have to put all the rest of my code in __main__ unless I write everything in functions which are called by __main__?
The notion of __main__ has been always pretty confusing to me.
if __name__ == '__main__': is literally just "if this file is being run as a script, as opposed to being imported as a module, then do this". __name__ is a hidden variable that gets set to '__main__' if it's being run as a script. why it works this way is beyond the scope of this discussion but suffice it to say it has to do with how python evaluates sourcefiles top-to-bottom.
In other words, you can put the other two lines anywhere you want - in a function, probably, that you call elsewhere in the program. You could return output from that function, or do other processing on it, or etc., whatever you happen to need.

Python multiprocessing pipe and cloudpickle

I have observed that this code can't run on windows, it seems that the PipeConnection handles can't be copied 1:1 so I asume that multiprocessing lib does some kind of extra work when dealing with Process args of type PipeConnection. This is a toy example of the problem:
import multiprocessing, cloudpickle
def _thunk(pipe):
def f():
test = pipe[1].recv()
pipe[1].send(test)
return f
def _worker(pickled_f):
f = cloudpickle.loads(pickled_f)
f()
if __name__ == '__main__':
pipe = multiprocessing.Pipe()
pickled_f = cloudpickle.dumps(_thunk(pipe))
multiprocessing.Process(target=_worker, args=(pickled_f,)).start()
pipe[0].send('test')
test = pipe[0].recv()
print(test)
I want to get around this but I can't modify the multiprocessing.Process call, because it is in a lib outside of my code. Other synchronization mechanisms are welcome as long as I can encapsulate the logic inside the thunk. But ideally I want to be able to reconstruct a working Pipe in my new process.
Thanks.

How to use multiprocessing.Pool in an imported module?

I have not been able to implement the suggestion here: Applying two functions to two lists simultaneously.
I guess it is because the module is imported by another module and thus my Windows spawns multiple python processes?
My question is: how can I use the code below without the if if __name__ == "__main__":
args_m = [(mortality_men, my_agents, graveyard, families, firms, year, agent) for agent in males]
args_f = [(mortality_women, fertility, year, families, my_agents, graveyard, firms, agent) for agent in females]
with mp.Pool(processes=(mp.cpu_count() - 1)) as p:
p.map_async(process_males, args_m)
p.map_async(process_females, args_f)
Both process_males and process_females are fuctions.
args_m, args_f are iterators
Also, I don't need to return anything. Agents are class instances that need updating.
The reason you need to guard multiprocessing code in a if __name__ == "__main__" is that you don't want it to run again in the child process. That can happen on Windows, where the interpreter needs to reload all of its state since there's no fork system call that will copy the parent process's address space. But you only need to use it where code is supposed to be running at the top level since you're in the main script. It's not the only way to guard your code.
In your specific case, I think you should put the multiprocessing code in a function. That won't run in the child process, as long as nothing else calls the function when it should not. Your main module can import the module, then call the function (from within an if __name__ == "__main__" block, probably).
It should be something like this:
some_module.py:
def process_males(x):
...
def process_females(x):
...
args_m = [...] # these could be defined inside the function below if that makes more sense
args_f = [...]
def do_stuff():
with mp.Pool(processes=(mp.cpu_count() - 1)) as p:
p.map_async(process_males, args_m)
p.map_async(process_females, args_f)
main.py:
import some_module
if __name__ == "__main__":
some_module.do_stuff()
In your real code you might want to pass some arguments or get a return value from do_stuff (which should also be given a more descriptive name than the generic one I've used in this example).
The idea of if __name__ == '__main__': is to avoid infinite process spawning.
When pickling a function defined in your main script, python has to figure out what part of your main script is the function code. It will basically re run your script. If your code creating the Pool is in the same script and not protected by the "if main", then by trying to import the function, you will try to launch another Pool that will try to launch another Pool....
Thus you should separate the function definitions from the actual main script:
from multiprocessing import Pool
# define test functions outside main
# so it can be imported withou launching
# new Pool
def test_func():
pass
if __name__ == '__main__':
with Pool(4) as p:
r = p.apply_async(test_func)
... do stuff
result = r.get()
Cannot yet comment on the question, but a workaround I have used that some have mentioned is just to define the process_males etc. functions in a module that is different to where the processes are spawned. Then import the module containing the multiprocessing spawns.
I solved it by calling the modules' multiprocessing function within "if __ name__ == "__ main__":" of the main script, as the function that involves multiprocessing is the last step in my module, others could try if aplicable.

Is it possible to use multiprocessing in a module with windows?

I'm currently going through some pre-existing code with the goal of speeding it up. There's a few places that are extremely good candidates for parallelization. Since Python has the GIL, I thought I'd use the multiprocess module.
However from my understanding the only way this will work on windows is if I call the function that needs multiple processes from the highest-level script with the if __name__=='__main__' safeguard. However, this particular program was meant to be distributed and imported as a module, so it'd be kind of clunky to have the user copy and paste that safeguard and is something I'd really like to avoid doing.
Am I out of luck or misunderstanding something as far as multiprocessing goes? Or is there any other way to do it with Windows?
For everyone still searching:
inside module
from multiprocessing import Process
def printing(a):
print(a)
def foo(name):
var={"process":{}}
if name == "__main__":
for i in range(10):
var["process"][i] = Process(target=printing , args=(str(i)))
var["process"][i].start()
for i in range(10):
var["process"][i].join
inside main.py
import data
name = __name__
data.foo(name)
output:
>>2
>>6
>>0
>>4
>>8
>>3
>>1
>>9
>>5
>>7
I am a complete noob so please don't judge the coding OR presentation but at least it works.
As explained in comments, perhaps you could do something like
#client_main.py
from mylib.mpSentinel import MPSentinel
#client logic
if __name__ == "__main__":
MPSentinel.As_master()
#mpsentinel.py
class MPSentinel(object):
_is_master = False
#classmethod
def As_master(cls):
cls._is_master = True
#classmethod
def Is_master(cls):
return cls._is_master
It's not ideal in that it's effectively a singleton/global but it would work around window's lack of fork. Still you could use MPSentinel.Is_master() to use multiprocessing optionally and it should prevent Windows from process bombing.
On ms-windows, you should be able to import the main module of a program without side effects like starting a process.
When Python imports a module, it actually runs it.
So one way of doing that is in the if __name__ is '__main__' block.
Another way is to do it from within a function.
The following won't work on ms-windows:
from multiprocessing import Process
def foo():
print('hello')
p = Process(target=foo)
p.start()
This is because it tries to start a process when importing the module.
The following example from the programming guidelines is OK:
from multiprocessing import Process, freeze_support, set_start_method
def foo():
print('hello')
if __name__ == '__main__':
freeze_support()
set_start_method('spawn')
p = Process(target=foo)
p.start()
Because the code in the if block doesn't run when the module is imported.
But putting it in a function should also work:
from multiprocessing import Process
def foo():
print('hello')
def bar()
p = Process(target=foo)
p.start()
When this module is run, it will define two new functions, not run then.
i've been developing an instagram images scraper so in order to get the download & save operations run faster i've implemented multiprocesing in one auxiliary module, note that this code it's inside an auxiliary module and not inside the main module.
The solution I found is adding this line:
if __name__ != '__main__':
pretty simple but it's actually working!
def multi_proces(urls, profile):
img_saved = 0
if __name__ != '__main__': # line needed for the sake of getting this NOT to crash
processes = []
for url in urls:
try:
process = multiprocessing.Process(target=download_save, args=[url, profile, img_saved])
processes.append(process)
img_saved += 1
except:
continue
for proce in processes:
proce.start()
for proce in processes:
proce.join()
return img_saved
def download_save(url, profile,img_saved):
file = requests.get(url, allow_redirects=True) # Download
open(f"scraped_data\{profile}\{profile}-{img_saved}.jpg", 'wb').write(file.content) # Save

Categories

Resources