I want to kick off an infinite loop in a multiprocess in module B from module A. At a later point, I want to terminate the multiprocess from module A as well.
The problem I am having is that if I try to save a boolean like keep_running in B, it keeps getting reset, even when I tried using global. I have read that global is best avoided.
A
import multiprocessing
keep_running = True
def stop_it(conf):
global keep_running
keep_running = False
def start_it(arg):
while keep_running:
do_stuff(arg)
time.sleep(1)
def main(args):
...
configs.configuartions.Configurations()
configs.process = multiprocessing.Process(target=start_it, args=(arg,))
configs.process.start()
if __name__ == 'main':
import sys
main(sys.argv[1:])
B
import A.main as m
def on_init(<args>):
m.main([conf_file, database])
def on_stop(conf):
m.stop_it(conf)
What is the best way to accomplish this?
If you are using multiprocessing, you have created separate processes that cannot access each other's variables. In such case you have to communicate through pipes, signals, sockets or flag files.
To be able to access each other variables, you would use threading instead of multiprocessing. However, you can still kill a different process using os.kill by sending a SIGTERM or SIGKILL to it. You would need PID for this, you can get it as configs.process.pid in your example.
More information is in official documentation: https://docs.python.org/3/library/multiprocessing.html
Related
I have a big code that take a while to make calculation, I have decided to learn about multithreading and multiprocessing because only 20% of my processor was being used to make the calculation. After not having any improvement with multithreading, I have decided to try multiprocessing and whenever I try to use it, it just show a lot of errors even on a very simple code.
this is the code that I tested after starting having problems with my big calculation heavy code :
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
main()
and in the error message that I amhaving it says
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
this is not the whole message because it is very big but I think that I may be helpful in order to help me. Pretty much everything else on the error message is just like "error at line ... in ..."
If it may be helpful the big code is at : https://github.com/nobody48sheldor/fuseeinator2.0
it might not be the latest version.
I updated your code to show main being called. This is an issue with spawning operating systems like Windows. To test on my linux machine I had to add a bit of code. But this crashes on my machine:
# Test code to make linux spawn like Windows and generate error. This code
# # is not needed on windows.
if __name__ == "__main__":
import multiprocessing as mp
mp.freeze_support()
mp.set_start_method('spawn')
# test script
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
main()
In a spawning system, python can't just fork into a new execution context. Instead, it runs a new instance of the python interpreter, imports the module and pickles/unpickles enough state to make a child execution environment. This can be a very heavy operation.
But your script is not import safe. Since main() is called at module level, the import in the child would run main again. That would create a grandchild subprocess which runs main again (and etc until you hang your machine). Python detects this infinite loop and displays the message instead.
Top level scripts are always called "__main__". Put all of the code that should only be run once at the script level inside an if. If the module is imported, nothing harmful is run.
if __name__ == "__main__":
main()
and the script will work.
There are code analyzers out there that import modules to extract doc strings, or other useful stuff. Your code shouldn't fire the missiles just because some tool did an import.
Another way to solve the problem is to move everything multiprocessing related out of the script and into a module. Suppose I had a module with your code in it
whatever.py
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
myscript.py
#!/usr/bin/env pythnon3
import whatever
whatever.main()
Now, since the pool is laready in an imported module that doesn't do this crazy restart-itself thing, no if __name__ == "__main__": is necessary. Its a good idea to put it in myscript.py anyway, but not required.
Let's say I have three modules:
mod1
mod2
mod3
where each of them runs infinitely long as soon as mod.launch() is called.
What are some elegant ways to launch all these infinite loops at once, without waiting for one to finish before calling the other?
Let's say I'd have a kind of launcher.py, where I'd try to:
import mod1
import mod2
import mod3
if __name__ == "__main__":
mod1.launch()
mod2.launch()
mod3.launch()
This obviously doesn't work, as It will wait for mod1.launch() to finish before launching mod2.launch().
Any kind of help is appreciated.
If you would like to execute multiple functions in parallel, you can use either the multiprocessing library, or concurrent.futures.ProcessPoolExecutor. ProcessPoolExecutor uses multiprocessing internally, but has a simpler interface.
Depending on the nature of the work being done in each task, the answer varies.
If each task is mostly or all IO-bound, I would recommend multithreading.
If each task is CPU-bound, I would recommend multiprocessing (due to the GIL in python).
You can also use the threading module to run each module on a separate thread, but within the same process:
import threading
import mod1
import mod2
import mod3
if __name__ == "__main__":
# make a list of all modules we want to run, for convenience
mods = [mod1, mod2, mod3]
# Prepare a thread for each module to run the `launch()` method
threads = [threading.Thread(target=mod.launch) for mod in mods]
# run all threads
for thread in threads:
thread.start()
# wait for all threads to finish
for thread in threads:
thread.join()
The multiprocess module performs a very similar set of tasks and has a very similar API, but uses separate processes instead of threads, so you can use that too.
I'd suggest using Ray, which is a library for parallel and distributed Python. It has some advantages over the standard threading and multiprocessing libraries.
The same code will run on a single machine or on multiple machines.
You can parallelize both functions and classes.
Objects are shared efficiently between tasks using shared memory.
To provide a simple runnable example, I'll use functions and classes instead of modules, but you can always wrap the module in a function or class.
Approach 1: Parallel functions using tasks.
import ray
import time
ray.init()
#ray.remote
def mod1():
time.sleep(3)
#ray.remote
def mod2():
time.sleep(3)
#ray.remote
def mod3():
time.sleep(3)
if __name__ == '__main__':
# Start the tasks. These will run in parallel.
result_id1 = mod1.remote()
result_id2 = mod2.remote()
result_id3 = mod3.remote()
# Don't exit the interpreter before the tasks have finished.
ray.get([result_id1, result_id2, result_id3])
Approach 2: Parallel classes using actors.
import ray
import time
# Don't run this again if you've already run it.
ray.init()
#ray.remote
class Mod1(object):
def run(self):
time.sleep(3)
#ray.remote
class Mod2(object):
def run(self):
time.sleep(3)
#ray.remote
class Mod3(object):
def run(self):
time.sleep(3)
if __name__ == '__main__':
# Create 3 actors.
mod1 = Mod1.remote()
mod2 = Mod2.remote()
mod3 = Mod3.remote()
# Start the methods, these will run in parallel.
result_id1 = mod1.run.remote()
result_id2 = mod2.run.remote()
result_id3 = mod3.run.remote()
# Don't exit the interpreter before the tasks have finished.
ray.get([result_id1, result_id2, result_id3])
You can see the Ray documentation.
I am new to parallelization in general and concurrent.futures in particular. I want to benchmark my script and compare the differences between using threads and processes, but I found that I couldn't even get that running because when using ProcessPoolExecutor I cannot use my global variables.
The following code will output Helloas I expect, but when you change ThreadPoolExecutor for ProcessPoolExecutor, it will output None.
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
greeting = None
def process():
print(greeting)
return None
def main():
with ThreadPoolExecutor(max_workers=1) as executor:
executor.submit(process)
return None
def init():
global greeting
greeting = 'Hello'
return None
if __name__ == '__main__':
init()
main()
I don't understand why this is the case. In my real program, init is used to set the global variables to CLI arguments, and there are a lot of them. Hence, passing them as arguments does not seem recommended. So how do I pass those global variables to each process/thread correctly?
I know that I can change things around, which will work, but I don't understand why. E.g. the following works for both Executors, but it also means that the globals initialisation has to happen for every instance.
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
greeting = None
def init():
global greeting
greeting = 'Hello'
return None
def main():
with ThreadPoolExecutor(max_workers=1) as executor:
executor.submit(process)
return None
def process():
init()
print(greeting)
return None
if __name__ == '__main__':
main()
So my main question is, what is actually happening. Why does this code work with threads and not with processes? And, how do I correctly pass set globals to each process/thread without having to re-initialise them for every instance?
(Side note: because I have read that concurrent.futures might behave differently on Windows, I have to note that I am running Python 3.6 on Windows 10 64 bit.)
I'm not sure of the limitations of this approach, but you can pass (serializable?) objects between your main process/thread. This would also help you get rid of the reliance on global vars:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
def process(opts):
opts["process"] = "got here"
print("In process():", opts)
return None
def main(opts):
opts["main"] = "got here"
executor = [ProcessPoolExecutor, ThreadPoolExecutor][1]
with executor(max_workers=1) as executor:
executor.submit(process, opts)
return None
def init(opts): # Gather CLI opts and populate dict
opts["init"] = "got here"
return None
if __name__ == '__main__':
cli_opts = {"__main__": "got here"} # Initialize dict
init(cli_opts) # Populate dict
main(cli_opts) # Use dict
Works with both executor types.
Edit: Even though it sounds like it won't be a problem for your use case, I'll point out that with ProcessPoolExecutor, the opts dict you get inside process will be a frozen copy, so mutations to it will not be visible across processes nor will they be visible once you return to the __main__ block. ThreadPoolExecutor, on the other hand, will share the dict object between threads.
Actually, the first code of the OP will work as intended on Linux (tested in Python 3.6-3.8) because
On Unix a child process can make use of a shared resource created in a
parent process using a global resource.
as explained in multiprocessing doc. However, for a mysterious reasons, it won't work on my Mac running Mojave (which is supposed to be a UNIX-compliant OS; tested only with Python 3.8). And for sure, it won't work on Windows, and it's in general not a recommended practice with multiple processes.
Let's image a process is a box while a thread is a worker inside a box. A worker can only access the resources in the box and cannot touch the other resources in other boxes.
So when you use threads, you are creating multiple workers for your current box(main process). But when you use process, you are creating another box. In this case, the global variables initialised in this box is completely different from ones in another box. That's why it doesn't work as you expect.
The solution given by jedwards is good enough for most situations. You can expilictly package the resources in current box(serialize variables) and deliver it to another box(transport to another process) so that the workers in that box have access to the resources.
A process represents activity that is run in a separate process in the OS meaning of the term while threads all run in your main process. Every process has its own unique namespace.
Your main process sets the value to greeting by calling init() inside your __name__ == '__main__'condition for its own namespace. In your new process, this does not happen (__name__ is '__mp_name__' here) hence greeting remains None and init() is never actually called unless you do so explicitly in the function your process executes.
While sharing state between processes is generally not recommended, there are ways to do so, like outlined in #jedwards answer.
You might also want to check Sharing State Between Processes from the docs.
I'm currently going through some pre-existing code with the goal of speeding it up. There's a few places that are extremely good candidates for parallelization. Since Python has the GIL, I thought I'd use the multiprocess module.
However from my understanding the only way this will work on windows is if I call the function that needs multiple processes from the highest-level script with the if __name__=='__main__' safeguard. However, this particular program was meant to be distributed and imported as a module, so it'd be kind of clunky to have the user copy and paste that safeguard and is something I'd really like to avoid doing.
Am I out of luck or misunderstanding something as far as multiprocessing goes? Or is there any other way to do it with Windows?
For everyone still searching:
inside module
from multiprocessing import Process
def printing(a):
print(a)
def foo(name):
var={"process":{}}
if name == "__main__":
for i in range(10):
var["process"][i] = Process(target=printing , args=(str(i)))
var["process"][i].start()
for i in range(10):
var["process"][i].join
inside main.py
import data
name = __name__
data.foo(name)
output:
>>2
>>6
>>0
>>4
>>8
>>3
>>1
>>9
>>5
>>7
I am a complete noob so please don't judge the coding OR presentation but at least it works.
As explained in comments, perhaps you could do something like
#client_main.py
from mylib.mpSentinel import MPSentinel
#client logic
if __name__ == "__main__":
MPSentinel.As_master()
#mpsentinel.py
class MPSentinel(object):
_is_master = False
#classmethod
def As_master(cls):
cls._is_master = True
#classmethod
def Is_master(cls):
return cls._is_master
It's not ideal in that it's effectively a singleton/global but it would work around window's lack of fork. Still you could use MPSentinel.Is_master() to use multiprocessing optionally and it should prevent Windows from process bombing.
On ms-windows, you should be able to import the main module of a program without side effects like starting a process.
When Python imports a module, it actually runs it.
So one way of doing that is in the if __name__ is '__main__' block.
Another way is to do it from within a function.
The following won't work on ms-windows:
from multiprocessing import Process
def foo():
print('hello')
p = Process(target=foo)
p.start()
This is because it tries to start a process when importing the module.
The following example from the programming guidelines is OK:
from multiprocessing import Process, freeze_support, set_start_method
def foo():
print('hello')
if __name__ == '__main__':
freeze_support()
set_start_method('spawn')
p = Process(target=foo)
p.start()
Because the code in the if block doesn't run when the module is imported.
But putting it in a function should also work:
from multiprocessing import Process
def foo():
print('hello')
def bar()
p = Process(target=foo)
p.start()
When this module is run, it will define two new functions, not run then.
i've been developing an instagram images scraper so in order to get the download & save operations run faster i've implemented multiprocesing in one auxiliary module, note that this code it's inside an auxiliary module and not inside the main module.
The solution I found is adding this line:
if __name__ != '__main__':
pretty simple but it's actually working!
def multi_proces(urls, profile):
img_saved = 0
if __name__ != '__main__': # line needed for the sake of getting this NOT to crash
processes = []
for url in urls:
try:
process = multiprocessing.Process(target=download_save, args=[url, profile, img_saved])
processes.append(process)
img_saved += 1
except:
continue
for proce in processes:
proce.start()
for proce in processes:
proce.join()
return img_saved
def download_save(url, profile,img_saved):
file = requests.get(url, allow_redirects=True) # Download
open(f"scraped_data\{profile}\{profile}-{img_saved}.jpg", 'wb').write(file.content) # Save
I have a program that, among other things, parses some big files, and I would like to have this done in parallel to save time.
The code flow looks something like this:
if __name__ == '__main__':
obj = program_object()
obj.do_so_some_stuff(argv)
obj.field1 = parse_file_one(f1)
obj.field2 = parse_file_two(f2)
obj.do_some_more_stuff()
I tried running the file parsing methods in separate processes like this:
p_1 = multiprocessing.Process(target=parse_file_one, args=(f1))
p_2 = multiprocessing.Process(target=parse_file_two, args=(f2))
p_1.start()
p_2.start()
p_1.join()
p_2.join()
There are 2 problems here. One is how to have the separate process modify the filed, but more importantly, forking the process duplicates my whole main! I get exception regarding argv when executing the
do_so_some_stuff(argv)
second time. That really is not what I wanted. It even happened when I run only 1 of the Processes.
How could I get just the file parsing methods to run in parallel to each other, and then continue back with main process like before?
Try putting the parsing methods in a separate module.
First, i guess instead of:
obj = program_object()
program_object.do_so_some_stuff(argv)
you mean:
obj = program_object()
obj.do_so_some_stuff(argv)
Second, try using threading like this:
#!/usr/bin/python
import thread
if __name__ == '__main__':
try:
thread.start_new_thread( parse_file_one, (f1) )
thread.start_new_thread( parse_file_two, (f2) )
except:
print "Error: unable to start thread"
But, as pointed out by Wooble, depending on the implementation of your parsing functions, this might not be a solution that executes truly in parallel, because of the GIL.
In that case, you should check the Python multiprocessing module that will do true concurrent execution:
multiprocessing is a package that supports spawning processes using an
API similar to the threading module. The multiprocessing package
offers both local and remote concurrency, effectively side-stepping
the Global Interpreter Lock by using subprocesses instead of threads.
Due to this, the multiprocessing module allows the programmer to fully
leverage multiple processors on a given machine.