I use Rust to speed up a data processing pipeline, but I have to run some existing Python code as-is, which I want to parallelize. Following discussion in another question, creating multiple Python processes is a possible approach given my project's specific constraints. However, running the code below gives an infinite loop. I can't quite understand why.
use cpython::Python;
fn main() {
let gil = Python::acquire_gil();
let py = gil.python();
py.run(r#"
import sys
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
print('start')
sys.argv=['']
p = Process(target=f, args=('bob',))
p.start()
p.join()
"#, None,None).unwrap();
}
Output (continues until Ctrl-C):
start
start
start
start
start
start
start
start
EDIT
As mentioned in the comments below, I gave up on trying to create processes from the Python code. The interference between Windows, the Python multiprocessing module, and how processes are created with Rust are too obscure to manage properly.
So instead I will create and manage them from Rust. The code is therefore more textbook:
use std::process::Command;
fn main() {
let mut cmd = Command::new("python");
cmd.args(&["-c", "print('test')"]);
let process = cmd.spawn().expect("Couldn't spawn process.");
println!("{:?}", process.wait_with_output().unwrap());
}
I can't reproduce this; for me it just prints start and then hello bob as expected. For whatever reason, it seems that in your case, __name__ is always equal to "__main__" and you get this infinite recursion. I'm using the cpython crate version v0.4.1 and Python 3.8.1 on Arch Linux.
A workaround is to not depend on __name__ at all, but to instead define your Python code as a module with a main() function and then call that function:
use cpython::{Python, PyModule};
fn main() {
let gil = Python::acquire_gil();
let py = gil.python();
let module = PyModule::new(py, "bob").unwrap();
py.run(r#"
import sys
from multiprocessing import Process
def f(name):
print('hello', name)
def main():
print('start')
sys.argv=['']
p = Process(target=f, args=('bob',))
p.start()
p.join()
"#, Some(&module.dict(py)), None).unwrap();
module.call(py, "main", cpython::NoArgs, None).unwrap();
}
Related
I have a big code that take a while to make calculation, I have decided to learn about multithreading and multiprocessing because only 20% of my processor was being used to make the calculation. After not having any improvement with multithreading, I have decided to try multiprocessing and whenever I try to use it, it just show a lot of errors even on a very simple code.
this is the code that I tested after starting having problems with my big calculation heavy code :
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
main()
and in the error message that I amhaving it says
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
this is not the whole message because it is very big but I think that I may be helpful in order to help me. Pretty much everything else on the error message is just like "error at line ... in ..."
If it may be helpful the big code is at : https://github.com/nobody48sheldor/fuseeinator2.0
it might not be the latest version.
I updated your code to show main being called. This is an issue with spawning operating systems like Windows. To test on my linux machine I had to add a bit of code. But this crashes on my machine:
# Test code to make linux spawn like Windows and generate error. This code
# # is not needed on windows.
if __name__ == "__main__":
import multiprocessing as mp
mp.freeze_support()
mp.set_start_method('spawn')
# test script
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
main()
In a spawning system, python can't just fork into a new execution context. Instead, it runs a new instance of the python interpreter, imports the module and pickles/unpickles enough state to make a child execution environment. This can be a very heavy operation.
But your script is not import safe. Since main() is called at module level, the import in the child would run main again. That would create a grandchild subprocess which runs main again (and etc until you hang your machine). Python detects this infinite loop and displays the message instead.
Top level scripts are always called "__main__". Put all of the code that should only be run once at the script level inside an if. If the module is imported, nothing harmful is run.
if __name__ == "__main__":
main()
and the script will work.
There are code analyzers out there that import modules to extract doc strings, or other useful stuff. Your code shouldn't fire the missiles just because some tool did an import.
Another way to solve the problem is to move everything multiprocessing related out of the script and into a module. Suppose I had a module with your code in it
whatever.py
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
myscript.py
#!/usr/bin/env pythnon3
import whatever
whatever.main()
Now, since the pool is laready in an imported module that doesn't do this crazy restart-itself thing, no if __name__ == "__main__": is necessary. Its a good idea to put it in myscript.py anyway, but not required.
I am tying to run a loop with different treads to speed the process. And I don't find other way to do but to create another script and call it with subprocess and give it the big array as argument..
I put the code commented to explain the probleme..
the script I am trying to call with subprocess:
import multiprocessing
from joblib import Parallel, delayed
import sys
num_cores = multiprocessing.cpu_count()
inputs = sys.argv[1:]
print(inputs)
def evaluate_individual(ind):
ind.evaluate()
ind.already_evaluate = True
if __name__ == "__main__":
# i am trying to execute a loop with multi thread.
# but it need to be in the "__main__" but I can in my main script, so I create an external script...
Parallel(n_jobs=num_cores)(delayed(evaluate_individual)(i) for i in inputs)
the script who call the other script:
import subprocess, sys
from clean import IndividualExamples
# the big array
inputs = []
for i in range(200):
ind = IndividualExamples.simple_individual((None, None), None)
inputs.append(ind)
# and now I need to call this code from another script...
# but as arguments I pass a big array and I don't now if there are a better method
# AND... this don't work to call the subprocess and return no error so I don't now who to do..
subprocess.Popen(['py','C:/Users/alexa/OneDrive/Documents/Programmes/Neronal Network/multi_thread.py', str(inputs)])
thanks for your help, if you now another way to run a loop with multi thread in a function (not in the main) tell me as well.
And sorry for my approximative english.
edit: I tried with a pool but same probleme I need to put It in the "main", so I can put it in a function in my script (as i need to use it)
modified code with pool:
if __name__ == "__main__":
tps1 = time.time()
with multiprocessing.Pool(12) as p:
outputs = p.map(evaluate_individual, inputs)
tps2 = time.time()
print(tps2 - tps1)
New_array = outputs
Other little question, I try with a simple loop and with the Pool multiprocess and a compare the time of the 2 process:
simple loop: 0.022951126098632812
multi thread: 0.9151067733764648
and I go this... why the multi Process On 12 cores can be longer than a simple loop ?
You can do it using a multiprocessing.Pool
import subprocess, sys
from multiprocessing import Pool
from clean import IndividualExamples
def evaluate_individual(ind):
ind.evaluate()
# the big array
pool = Pool()
for i in range(200):
ind = IndividualExamples.simple_individual((None, None), None)
pool.apply_async(func=evaluate_individual, args=(ind,))
pool.close()
pool.join()
thanks, I just try but this raise a run time error:
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Last year I wrote a simple code to mosaic hundreds of raster data. It went well.
Last week I picked up this code to do the same work, but it didn't work.
I copy the official demo:
from multiprocessing import Process
def f(name):
print 'hello', name
if __name__ == '__main__':
p = Process(target=f, args=('bob',))
p.start()
p.join()
But there is no feedback, no report error. If I switch multiprocessing module to Thread module, it goes well.
Then I find out that the the demo can be ran in Python console but not in IPython console.
PS, I use WinPython-64bit, and have tried 2.7 and 3.5 version, both have the same problem. And I cannot use multiprocessing module in the ArcGIS python console.
Thanks
I have not understand your question that much.
but, if you want to use Thread, you can try this.
from multiprocessing.dummy import Pool as ThreadPool
def f(name):
print 'hello', name
if __name__ == '__main__':
pool = ThreadPool(4) # 4 is number of your CPU core, 4 times faster
names = ['bob','jack','newt'] #prepare your name list
results = pool.map(f, names)
pool.close()
pool.join()
In python 3, I am trying to run the same function with multiple arguments at the same time. I am using multiprocessing in Python 3.5.2, Anaconda 4.1.1 (64-bit), in Windows 7. I am getting the following error regarding spawn.py:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
My code is:
from multiprocessing import Process
#Function to be run
def savedatetime(password,rangestart,rangeend):
print((password,rangestart,rangeend))
# Inputs to the function
passwords=['p0','p1','p2']
ranges=[(1,10),(10,20),(20,30)]
# Creating the list of processes to append
fileind=0
proc=[]
for rangeind in range(0,len(passwords)):
password=passwords[fileind]
rangestart=ranges[fileind][0]
rangeend=ranges[fileind][1]
p = Process(target=savedatetime,args=(password,rangestart,rangeend))
proc.append(p)
fileind=fileind+1
# running sequentially to check that the function works
savedatetime(password,rangestart,rangeend)
print(proc)
# Attempting to run simultaneously. This is where the error occurs:
for p in proc:
p.start()
p.join()
Could you please help me to fix my code so the multiple instances of the same function run simultaneously? Thanks!
You need a gatekeeper test to prevent your code from executing again when Windows simulates forking by reimporting your main module. It's why all main modules should control their "main-like" behavior with the if __name__ == '__main__': gatekeeper check.
The simplest solution is to take all the "main" code, indent it one level, and define a function named main that contains it.
The main function would be:
def main():
# Inputs to the function
passwords=['p0','p1','p2']
ranges=[(1,10),(10,20),(20,30)]
# Creating the list of processes to append
... omitted for brevity ...
# Attempting to run simultaneously. This is where the error occurs:
for p in proc:
p.start()
# Join after all processes started, so you actually get parallelism
for p in proc:
p.join()
Then just add the following to the end of your file:
if __name__ == '__main__':
main()
The __name__ check prevents your main function from being re-executed when you spawn the worker processes.
I'm currently going through some pre-existing code with the goal of speeding it up. There's a few places that are extremely good candidates for parallelization. Since Python has the GIL, I thought I'd use the multiprocess module.
However from my understanding the only way this will work on windows is if I call the function that needs multiple processes from the highest-level script with the if __name__=='__main__' safeguard. However, this particular program was meant to be distributed and imported as a module, so it'd be kind of clunky to have the user copy and paste that safeguard and is something I'd really like to avoid doing.
Am I out of luck or misunderstanding something as far as multiprocessing goes? Or is there any other way to do it with Windows?
For everyone still searching:
inside module
from multiprocessing import Process
def printing(a):
print(a)
def foo(name):
var={"process":{}}
if name == "__main__":
for i in range(10):
var["process"][i] = Process(target=printing , args=(str(i)))
var["process"][i].start()
for i in range(10):
var["process"][i].join
inside main.py
import data
name = __name__
data.foo(name)
output:
>>2
>>6
>>0
>>4
>>8
>>3
>>1
>>9
>>5
>>7
I am a complete noob so please don't judge the coding OR presentation but at least it works.
As explained in comments, perhaps you could do something like
#client_main.py
from mylib.mpSentinel import MPSentinel
#client logic
if __name__ == "__main__":
MPSentinel.As_master()
#mpsentinel.py
class MPSentinel(object):
_is_master = False
#classmethod
def As_master(cls):
cls._is_master = True
#classmethod
def Is_master(cls):
return cls._is_master
It's not ideal in that it's effectively a singleton/global but it would work around window's lack of fork. Still you could use MPSentinel.Is_master() to use multiprocessing optionally and it should prevent Windows from process bombing.
On ms-windows, you should be able to import the main module of a program without side effects like starting a process.
When Python imports a module, it actually runs it.
So one way of doing that is in the if __name__ is '__main__' block.
Another way is to do it from within a function.
The following won't work on ms-windows:
from multiprocessing import Process
def foo():
print('hello')
p = Process(target=foo)
p.start()
This is because it tries to start a process when importing the module.
The following example from the programming guidelines is OK:
from multiprocessing import Process, freeze_support, set_start_method
def foo():
print('hello')
if __name__ == '__main__':
freeze_support()
set_start_method('spawn')
p = Process(target=foo)
p.start()
Because the code in the if block doesn't run when the module is imported.
But putting it in a function should also work:
from multiprocessing import Process
def foo():
print('hello')
def bar()
p = Process(target=foo)
p.start()
When this module is run, it will define two new functions, not run then.
i've been developing an instagram images scraper so in order to get the download & save operations run faster i've implemented multiprocesing in one auxiliary module, note that this code it's inside an auxiliary module and not inside the main module.
The solution I found is adding this line:
if __name__ != '__main__':
pretty simple but it's actually working!
def multi_proces(urls, profile):
img_saved = 0
if __name__ != '__main__': # line needed for the sake of getting this NOT to crash
processes = []
for url in urls:
try:
process = multiprocessing.Process(target=download_save, args=[url, profile, img_saved])
processes.append(process)
img_saved += 1
except:
continue
for proce in processes:
proce.start()
for proce in processes:
proce.join()
return img_saved
def download_save(url, profile,img_saved):
file = requests.get(url, allow_redirects=True) # Download
open(f"scraped_data\{profile}\{profile}-{img_saved}.jpg", 'wb').write(file.content) # Save