In python 3, I am trying to run the same function with multiple arguments at the same time. I am using multiprocessing in Python 3.5.2, Anaconda 4.1.1 (64-bit), in Windows 7. I am getting the following error regarding spawn.py:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
My code is:
from multiprocessing import Process
#Function to be run
def savedatetime(password,rangestart,rangeend):
print((password,rangestart,rangeend))
# Inputs to the function
passwords=['p0','p1','p2']
ranges=[(1,10),(10,20),(20,30)]
# Creating the list of processes to append
fileind=0
proc=[]
for rangeind in range(0,len(passwords)):
password=passwords[fileind]
rangestart=ranges[fileind][0]
rangeend=ranges[fileind][1]
p = Process(target=savedatetime,args=(password,rangestart,rangeend))
proc.append(p)
fileind=fileind+1
# running sequentially to check that the function works
savedatetime(password,rangestart,rangeend)
print(proc)
# Attempting to run simultaneously. This is where the error occurs:
for p in proc:
p.start()
p.join()
Could you please help me to fix my code so the multiple instances of the same function run simultaneously? Thanks!
You need a gatekeeper test to prevent your code from executing again when Windows simulates forking by reimporting your main module. It's why all main modules should control their "main-like" behavior with the if __name__ == '__main__': gatekeeper check.
The simplest solution is to take all the "main" code, indent it one level, and define a function named main that contains it.
The main function would be:
def main():
# Inputs to the function
passwords=['p0','p1','p2']
ranges=[(1,10),(10,20),(20,30)]
# Creating the list of processes to append
... omitted for brevity ...
# Attempting to run simultaneously. This is where the error occurs:
for p in proc:
p.start()
# Join after all processes started, so you actually get parallelism
for p in proc:
p.join()
Then just add the following to the end of your file:
if __name__ == '__main__':
main()
The __name__ check prevents your main function from being re-executed when you spawn the worker processes.
Related
I am trying to run a separate Python Process and store the result in the queue. I can extract the result in two ways: either run queue.get() just once or use a while loop and iterate over queue until it`s empty.
In the code below first method is used if first=True and second method is used if first=False.
from multiprocessing import Process, Queue
def foo1(queue):
queue.put(1)
def main(first=False):
queue = Queue()
p = Process(target=foo1, args=(queue,))
p.start()
if first:
a = queue.get()
print(a)
else:
while not queue.empty():
print(queue.get())
p.join()
if __name__ == "__main__":
main()
Question: Why does first method print 1 correctly and second does not ? Aren`t they supposed to be equal ?
I am using Windows 10. I noticed this behavior in both interactive console and shell terminal.
Note: Due to the bug mentioned here I have to run the code as one script.
I use Rust to speed up a data processing pipeline, but I have to run some existing Python code as-is, which I want to parallelize. Following discussion in another question, creating multiple Python processes is a possible approach given my project's specific constraints. However, running the code below gives an infinite loop. I can't quite understand why.
use cpython::Python;
fn main() {
let gil = Python::acquire_gil();
let py = gil.python();
py.run(r#"
import sys
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
print('start')
sys.argv=['']
p = Process(target=f, args=('bob',))
p.start()
p.join()
"#, None,None).unwrap();
}
Output (continues until Ctrl-C):
start
start
start
start
start
start
start
start
EDIT
As mentioned in the comments below, I gave up on trying to create processes from the Python code. The interference between Windows, the Python multiprocessing module, and how processes are created with Rust are too obscure to manage properly.
So instead I will create and manage them from Rust. The code is therefore more textbook:
use std::process::Command;
fn main() {
let mut cmd = Command::new("python");
cmd.args(&["-c", "print('test')"]);
let process = cmd.spawn().expect("Couldn't spawn process.");
println!("{:?}", process.wait_with_output().unwrap());
}
I can't reproduce this; for me it just prints start and then hello bob as expected. For whatever reason, it seems that in your case, __name__ is always equal to "__main__" and you get this infinite recursion. I'm using the cpython crate version v0.4.1 and Python 3.8.1 on Arch Linux.
A workaround is to not depend on __name__ at all, but to instead define your Python code as a module with a main() function and then call that function:
use cpython::{Python, PyModule};
fn main() {
let gil = Python::acquire_gil();
let py = gil.python();
let module = PyModule::new(py, "bob").unwrap();
py.run(r#"
import sys
from multiprocessing import Process
def f(name):
print('hello', name)
def main():
print('start')
sys.argv=['']
p = Process(target=f, args=('bob',))
p.start()
p.join()
"#, Some(&module.dict(py)), None).unwrap();
module.call(py, "main", cpython::NoArgs, None).unwrap();
}
import random
import os
from multiprocessing import Process
num = random.randint(0, 100)
def show_num():
print("pid:{}, num is {}".format(os.getpid(), num))
if __name__ == '__main__':
print("pid:{}, num is {}".format(os.getpid(), num))
p = Process(target=show_num)
p.start()
p.join()
print('Parent Process Stop')
The above code shows the basic usage of creating a process. If I run this script in the windows environment, the variable num is different in the parent process and child process. However, the variable num is the same when the script run between the Linux environment.
I understand their mechanism of creating process is different. For example, the windows system doesn't have fork method.
But, Can someone give me a more detailed explanation of their difference?
Thank you very much.
The difference explaining the behavior described in your post is exactly what you mentioned: the start method used for creating the process. On Unix-style OSs, the default is fork. On Windows, the only available option is spawn.
fork
As described in the Overview section of this Wiki page (in a slightly different order):
The fork operation creates a separate address space for the child. The
child process has an exact copy of all the memory segments of the
parent process.
The child process calls the exec system call to overlay itself with the
other program: it ceases execution of its former program in favor of
the other.
This means that, when using fork, the child process already has the variable num in its address space and uses it. random.randint(0, 100) is not called again.
spawn
As the multiprocessing docs describe:
The parent process starts a fresh python interpreter process.
In this fresh interpreter process, the module from which the child is spawned is executed. Oversimplified, this does python.exe your_script.py a second time. Hence, a new variable num is created in the child process by assigning the return value of another call to random.randint(0, 100) to it. Therefore it is very likely, that the content of num differs between the processes.This is, by the way, also the reason why you absolutely need to safeguard the instantiation and start of a process with the if __name__ == '__main__' idiom when using spawn as start method, otherwise you end up with:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
You can use spawn in POSIX OSs as well, to mimic the behavior you have seen on Windows:
import random
import os
from multiprocessing import Process, set_start_method
import platform
num = random.randint(0, 100)
def show_num():
print("pid:{}, num is {}".format(os.getpid(), num))
if __name__ == '__main__':
print(platform.system())
# change the start method for new processes to spawn
set_start_method("spawn")
print("pid:{}, num is {}".format(os.getpid(), num))
p = Process(target=show_num)
p.start()
p.join()
print('Parent Process Stop')
Output:
Linux
pid:26835, num is 41
pid:26839, num is 13
Parent Process Stop
I have created a (rather large) program that takes quite a long time to finish, and I started looking into ways to speed up the program.
I found that if I open task manager while the program is running only one core is being used.
After some research, I found this website:
Why does multiprocessing use only a single core after I import numpy? which gives a solution of os.system("taskset -p 0xff %d" % os.getpid()),
however this doesn't work for me, and my program continues to run on a single core.
I then found this:
is python capable of running on multiple cores?,
which pointed towards using multiprocessing.
So after looking into multiprocessing, I came across this documentary on how to use it https://docs.python.org/3/library/multiprocessing.html#examples
I tried the code:
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob',))
p.start()
p.join()
a = input("Finished")
After running the code (not in IDLE) It said this:
Finished
hello bob
Finished
Note: after it said Finished the first time I pressed enter
So after this I am now even more confused and I have two questions
First: It still doesn't run with multiple cores (I have an 8 core Intel i7)
Second: Why does it input "Finished" before its even run the if statement code (and it's not even finished yet!)
To answer your second question first, "Finished" is printed to the terminal because a = input("Finished") is outside of your if __name__ == '__main__': code block. It is thus a module level constant which gets assigned when the module is first loaded and will execute before any code in the module runs.
To answer the first question, you only created one process which you run and then wait to complete before continuing. This gives you zero benefits of multiprocessing and incurs overhead of creating the new process.
Because you want to create several processes, you need to create a pool via a collection of some sort (e.g. a python list) and then start all of the processes.
In practice, you need to be concerned with more than the number of processors (such as the amount of available memory, the ability to restart workers that crash, etc.). However, here is a simple example that completes your task above.
import datetime as dt
from multiprocessing import Process, current_process
import sys
def f(name):
print('{}: hello {} from {}'.format(
dt.datetime.now(), name, current_process().name))
sys.stdout.flush()
if __name__ == '__main__':
worker_count = 8
worker_pool = []
for _ in range(worker_count):
p = Process(target=f, args=('bob',))
p.start()
worker_pool.append(p)
for p in worker_pool:
p.join() # Wait for all of the workers to finish.
# Allow time to view results before program terminates.
a = input("Finished") # raw_input(...) in Python 2.
Also note that if you join workers immediately after starting them, you are waiting for each worker to complete its task before starting the next worker. This is generally undesirable unless the ordering of the tasks must be sequential.
Typically Wrong
worker_1.start()
worker_1.join()
worker_2.start() # Must wait for worker_1 to complete before starting worker_2.
worker_2.join()
Usually Desired
worker_1.start()
worker_2.start() # Start all workers.
worker_1.join()
worker_2.join() # Wait for all workers to finish.
For more information, please refer to the following links:
https://docs.python.org/3/library/multiprocessing.html
Dead simple example of using Multiprocessing Queue, Pool and Locking
https://pymotw.com/2/multiprocessing/basics.html
https://pymotw.com/2/multiprocessing/communication.html
https://pymotw.com/2/multiprocessing/mapreduce.html
I am using multiprocessing to speed up my program and there is an enigma I can not solve.
I am using multiprocessing to write a lot of short files (based on a lot of input files) with the function writing_sub_file, and I finally concatenate all these files after the end of all the processes, using the function my_concat. Here are two samples of code. Note that this code is in my main .py file, but the function my_concat is imported from another module. The first one:
if __name__ == '__main__':
pool = Pool(processes=cpu_count())
arg_tuple = (work_path, article_dict, cat_len, date_to, time_period, val_matrix)
jobs = [(group, arg_tuple) for group in store_groups]
pool.apply_async(writing_sub_file, jobs)
pool.close()
pool.join()
my_concat(work_path)
which gives many errors (as many as there are processes in the pool) since It tries to apply my_concat before all my processes are done (I don't give the stack of the error since It is very clear that my_concat function tries to apply before every files have been written by the pool processes).
The second one:
if __name__ == '__main__':
pool = Pool(processes=cpu_count())
arg_tuple = (work_path, article_dict, cat_len, date_to, time_period, val_matrix)
jobs = [(group, arg_tuple) for group in store_groups]
pool.apply_async(writing_sub_file, jobs)
pool.close()
pool.join()
my_concat(work_path)
which works perfectly.
Can someone explain me the reason?
In the second, my_concat(work_path) is inside the if statement, and is therefore only executed if the script is running as the main script.
In the first, my_concat(work_path) is outside the if statement. When multiprocessing imports the module in a new Python session, it is not imported as __main__ but under its own name. Therefore this statement is run pretty much immediately, in each of your pool's processes, when your module is imported into that process.