Is it possible to use multiprocessing in a module with windows? - python

I'm currently going through some pre-existing code with the goal of speeding it up. There's a few places that are extremely good candidates for parallelization. Since Python has the GIL, I thought I'd use the multiprocess module.
However from my understanding the only way this will work on windows is if I call the function that needs multiple processes from the highest-level script with the if __name__=='__main__' safeguard. However, this particular program was meant to be distributed and imported as a module, so it'd be kind of clunky to have the user copy and paste that safeguard and is something I'd really like to avoid doing.
Am I out of luck or misunderstanding something as far as multiprocessing goes? Or is there any other way to do it with Windows?

For everyone still searching:
inside module
from multiprocessing import Process
def printing(a):
print(a)
def foo(name):
var={"process":{}}
if name == "__main__":
for i in range(10):
var["process"][i] = Process(target=printing , args=(str(i)))
var["process"][i].start()
for i in range(10):
var["process"][i].join
inside main.py
import data
name = __name__
data.foo(name)
output:
>>2
>>6
>>0
>>4
>>8
>>3
>>1
>>9
>>5
>>7
I am a complete noob so please don't judge the coding OR presentation but at least it works.

As explained in comments, perhaps you could do something like
#client_main.py
from mylib.mpSentinel import MPSentinel
#client logic
if __name__ == "__main__":
MPSentinel.As_master()
#mpsentinel.py
class MPSentinel(object):
_is_master = False
#classmethod
def As_master(cls):
cls._is_master = True
#classmethod
def Is_master(cls):
return cls._is_master
It's not ideal in that it's effectively a singleton/global but it would work around window's lack of fork. Still you could use MPSentinel.Is_master() to use multiprocessing optionally and it should prevent Windows from process bombing.

On ms-windows, you should be able to import the main module of a program without side effects like starting a process.
When Python imports a module, it actually runs it.
So one way of doing that is in the if __name__ is '__main__' block.
Another way is to do it from within a function.
The following won't work on ms-windows:
from multiprocessing import Process
def foo():
print('hello')
p = Process(target=foo)
p.start()
This is because it tries to start a process when importing the module.
The following example from the programming guidelines is OK:
from multiprocessing import Process, freeze_support, set_start_method
def foo():
print('hello')
if __name__ == '__main__':
freeze_support()
set_start_method('spawn')
p = Process(target=foo)
p.start()
Because the code in the if block doesn't run when the module is imported.
But putting it in a function should also work:
from multiprocessing import Process
def foo():
print('hello')
def bar()
p = Process(target=foo)
p.start()
When this module is run, it will define two new functions, not run then.

i've been developing an instagram images scraper so in order to get the download & save operations run faster i've implemented multiprocesing in one auxiliary module, note that this code it's inside an auxiliary module and not inside the main module.
The solution I found is adding this line:
if __name__ != '__main__':
pretty simple but it's actually working!
def multi_proces(urls, profile):
img_saved = 0
if __name__ != '__main__': # line needed for the sake of getting this NOT to crash
processes = []
for url in urls:
try:
process = multiprocessing.Process(target=download_save, args=[url, profile, img_saved])
processes.append(process)
img_saved += 1
except:
continue
for proce in processes:
proce.start()
for proce in processes:
proce.join()
return img_saved
def download_save(url, profile,img_saved):
file = requests.get(url, allow_redirects=True) # Download
open(f"scraped_data\{profile}\{profile}-{img_saved}.jpg", 'wb').write(file.content) # Save

Related

I am having problems with ProcessPoolExecutor from concurrent.futures

I have a big code that take a while to make calculation, I have decided to learn about multithreading and multiprocessing because only 20% of my processor was being used to make the calculation. After not having any improvement with multithreading, I have decided to try multiprocessing and whenever I try to use it, it just show a lot of errors even on a very simple code.
this is the code that I tested after starting having problems with my big calculation heavy code :
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
main()
and in the error message that I amhaving it says
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
this is not the whole message because it is very big but I think that I may be helpful in order to help me. Pretty much everything else on the error message is just like "error at line ... in ..."
If it may be helpful the big code is at : https://github.com/nobody48sheldor/fuseeinator2.0
it might not be the latest version.
I updated your code to show main being called. This is an issue with spawning operating systems like Windows. To test on my linux machine I had to add a bit of code. But this crashes on my machine:
# Test code to make linux spawn like Windows and generate error. This code
# # is not needed on windows.
if __name__ == "__main__":
import multiprocessing as mp
mp.freeze_support()
mp.set_start_method('spawn')
# test script
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
main()
In a spawning system, python can't just fork into a new execution context. Instead, it runs a new instance of the python interpreter, imports the module and pickles/unpickles enough state to make a child execution environment. This can be a very heavy operation.
But your script is not import safe. Since main() is called at module level, the import in the child would run main again. That would create a grandchild subprocess which runs main again (and etc until you hang your machine). Python detects this infinite loop and displays the message instead.
Top level scripts are always called "__main__". Put all of the code that should only be run once at the script level inside an if. If the module is imported, nothing harmful is run.
if __name__ == "__main__":
main()
and the script will work.
There are code analyzers out there that import modules to extract doc strings, or other useful stuff. Your code shouldn't fire the missiles just because some tool did an import.
Another way to solve the problem is to move everything multiprocessing related out of the script and into a module. Suppose I had a module with your code in it
whatever.py
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
myscript.py
#!/usr/bin/env pythnon3
import whatever
whatever.main()
Now, since the pool is laready in an imported module that doesn't do this crazy restart-itself thing, no if __name__ == "__main__": is necessary. Its a good idea to put it in myscript.py anyway, but not required.

Python multiprocessing with macOs

I have a mac (MacOs 10.15.4, Python ver 3.82) and need to work in multiprocessing, but on my pc the procedures doesn’t work.
For example, I have copied a simple parallel python program
import multiprocessing as mp
import time
def test_function(i):
print("function starts" + str(i))
time.sleep(1)
print("function ends" + str(i))
if __name__ == '__main__':
pool = mp.Pool(mp.cpu_count())
pool.map(test_function, [i for i in range(4)])
pool.close()
pool.join()
What I expect to see in the output:
function starts0
function ends0
function starts1
function ends1
function starts2
function ends2
function starts3
function ends3
Or similar...
What I actually see:
= RESTART: /Users/Simulazioni/prova.py
>>>
Just nothing, no errors and no informations, just nothing. I have already try mamy procedure without results. The main problem, I could see, is the call of the function, in fact the instruction:
if __name__ == '__main__':
doesn’t call the function,
def test_function(i):
I tried many example of that kind without results.
Is ti possible and/or what's is the easiest way to parallelize in macOs?
I know this is a bit of an old question, but I just faced the same issue and I solved it by using a different multithreading package's version, which is:
import multiprocess
instead of:
import multiprocessing
Reference:
https://pypi.org/project/multiprocess/

Use python's multiprocessing library in Rust

I use Rust to speed up a data processing pipeline, but I have to run some existing Python code as-is, which I want to parallelize. Following discussion in another question, creating multiple Python processes is a possible approach given my project's specific constraints. However, running the code below gives an infinite loop. I can't quite understand why.
use cpython::Python;
fn main() {
let gil = Python::acquire_gil();
let py = gil.python();
py.run(r#"
import sys
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
print('start')
sys.argv=['']
p = Process(target=f, args=('bob',))
p.start()
p.join()
"#, None,None).unwrap();
}
Output (continues until Ctrl-C):
start
start
start
start
start
start
start
start
EDIT
As mentioned in the comments below, I gave up on trying to create processes from the Python code. The interference between Windows, the Python multiprocessing module, and how processes are created with Rust are too obscure to manage properly.
So instead I will create and manage them from Rust. The code is therefore more textbook:
use std::process::Command;
fn main() {
let mut cmd = Command::new("python");
cmd.args(&["-c", "print('test')"]);
let process = cmd.spawn().expect("Couldn't spawn process.");
println!("{:?}", process.wait_with_output().unwrap());
}
I can't reproduce this; for me it just prints start and then hello bob as expected. For whatever reason, it seems that in your case, __name__ is always equal to "__main__" and you get this infinite recursion. I'm using the cpython crate version v0.4.1 and Python 3.8.1 on Arch Linux.
A workaround is to not depend on __name__ at all, but to instead define your Python code as a module with a main() function and then call that function:
use cpython::{Python, PyModule};
fn main() {
let gil = Python::acquire_gil();
let py = gil.python();
let module = PyModule::new(py, "bob").unwrap();
py.run(r#"
import sys
from multiprocessing import Process
def f(name):
print('hello', name)
def main():
print('start')
sys.argv=['']
p = Process(target=f, args=('bob',))
p.start()
p.join()
"#, Some(&module.dict(py)), None).unwrap();
module.call(py, "main", cpython::NoArgs, None).unwrap();
}

How to use multiprocessing.Pool in an imported module?

I have not been able to implement the suggestion here: Applying two functions to two lists simultaneously.
I guess it is because the module is imported by another module and thus my Windows spawns multiple python processes?
My question is: how can I use the code below without the if if __name__ == "__main__":
args_m = [(mortality_men, my_agents, graveyard, families, firms, year, agent) for agent in males]
args_f = [(mortality_women, fertility, year, families, my_agents, graveyard, firms, agent) for agent in females]
with mp.Pool(processes=(mp.cpu_count() - 1)) as p:
p.map_async(process_males, args_m)
p.map_async(process_females, args_f)
Both process_males and process_females are fuctions.
args_m, args_f are iterators
Also, I don't need to return anything. Agents are class instances that need updating.
The reason you need to guard multiprocessing code in a if __name__ == "__main__" is that you don't want it to run again in the child process. That can happen on Windows, where the interpreter needs to reload all of its state since there's no fork system call that will copy the parent process's address space. But you only need to use it where code is supposed to be running at the top level since you're in the main script. It's not the only way to guard your code.
In your specific case, I think you should put the multiprocessing code in a function. That won't run in the child process, as long as nothing else calls the function when it should not. Your main module can import the module, then call the function (from within an if __name__ == "__main__" block, probably).
It should be something like this:
some_module.py:
def process_males(x):
...
def process_females(x):
...
args_m = [...] # these could be defined inside the function below if that makes more sense
args_f = [...]
def do_stuff():
with mp.Pool(processes=(mp.cpu_count() - 1)) as p:
p.map_async(process_males, args_m)
p.map_async(process_females, args_f)
main.py:
import some_module
if __name__ == "__main__":
some_module.do_stuff()
In your real code you might want to pass some arguments or get a return value from do_stuff (which should also be given a more descriptive name than the generic one I've used in this example).
The idea of if __name__ == '__main__': is to avoid infinite process spawning.
When pickling a function defined in your main script, python has to figure out what part of your main script is the function code. It will basically re run your script. If your code creating the Pool is in the same script and not protected by the "if main", then by trying to import the function, you will try to launch another Pool that will try to launch another Pool....
Thus you should separate the function definitions from the actual main script:
from multiprocessing import Pool
# define test functions outside main
# so it can be imported withou launching
# new Pool
def test_func():
pass
if __name__ == '__main__':
with Pool(4) as p:
r = p.apply_async(test_func)
... do stuff
result = r.get()
Cannot yet comment on the question, but a workaround I have used that some have mentioned is just to define the process_males etc. functions in a module that is different to where the processes are spawned. Then import the module containing the multiprocessing spawns.
I solved it by calling the modules' multiprocessing function within "if __ name__ == "__ main__":" of the main script, as the function that involves multiprocessing is the last step in my module, others could try if aplicable.

can't multiprocess user defined code - cannot pickle

I tried using multiprocessing on my task, which generally means to do some calculation and then pass back the result. The problem is that the code defining the calculation is defined by user, it is compiled from string before the execution. This works perfect using exec(), eval() or compile() etc. when being run in the main process. The example below works only for f1 function but not for f2. I get 'Can't pickle class 'code'`. Is there any way round this? For example using multiprocessing differently? Or using other package? Or some more low level stuff? Unfortunatelly passing the string to the process and then compiling inside the process is not an option for me because of the design of the whole application (i.e. the code string is 'lost' and only the compiled version is available).
import multiprocessing
def callf(f, a):
exec(f, {'a': a})
if __name__ == "__main__":
f = compile("print(a)", filename="<string>", mode="exec")
callf(f, 10) # this works
process = multiprocessing.Process(target=callf, args=(f, 20)) # this does not work
process.start()
process.join()
UPDATE: here is another attempt, which is actually closer to my actual need. It results in different error message, but also cannot pickle the function.
import multiprocessing
if __name__ == "__main__":
source = "def f(): print('done')"
locals = dict()
exec(source, {}, locals)
f = locals['f']
f() # this works
process = multiprocessing.Process(target=f) # this does not work
process.start()
process.join()
pickle can't serialize code objects but dill can. There is a dill-based fork of multiprocessing called multiprocessing_on_dill but I have no idea how good it is. You could also just dill-encode the code object to make standard multiprocessing happy.
import multiprocessing
import dill
def callf_dilled(f_dilled, a):
return callf(dill.loads(f_dilled), a)
def callf(f, a):
exec(f, {'a': a})
if __name__ == "__main__":
f = compile("print(a)", filename="<string>", mode="exec")
callf(f, 10) # this works
process = multiprocessing.Process(target=callf_dilled,
args=(dill.dumps(f), 20)) # now this works too!
process.start()
process.join()

Categories

Resources