First of all I shall explain the structure of my scripts:
Script1 --(call)--> Script2 function and this function further calls 12-15 functions in different scripts through multiprocessing, and in those functions some have infinite loops and some have threading in it.
The function that contains threading further calls some functions having infinite loops in them.
The code of multiprocessing is just like:
Script1.py
def func1():
#some infinite functionality
Script2.py
def func2():
#some infinite functionality
Script3.py
def func3():
#Thread calling
#..
#..
#and so on...
from multiprocessing import Process
from Script1 import func1
from Script2 import func2
from Script3 import func3
process=[]
process.append(Process(target=func1))
process.append(Process(target=func2))
process.append(Process(target=func3))
process.append(Process(target=func4))
process.append(Process(target=func5))
process.append(Process(target=func6))
# ..
# ..
# and so on.
for p in process:
p.start()
print("process started:", p)
for p in process:
p.join()
print("process joined:", p)
Now the issues I faced are:
It only prints the first joined print statement, but all processes started successfully.(It means join for all processes not executed).
Second, is some processes becomes zombie, when run with multiprocessing but without multiprocessing when functions run externally works very well(so here I know that join cannot work properly due to some issue, our functions are fine).
Third, in my situation should I used multiprocessing. Pools or multiprocessing. Process works fine for me? Any suggestion?
Fourth, I want to know, if there are any alternatives to this except multithreading and mutliprocessing. Pools maybe?
I also tried a different thing but did not work like:
set_start_context('spawn')
Because it individually runs every process externally like we run scripts through popen externally, according to my knowledge
Note:
I am using Ubuntu 16.04. I have a scenario where I can use multiprocessing. Process, multiprocessing. Pool and something that works like multiprocessing, but cannot use multithreading here.
Related
I have a Python module that uses multiprocessing. I'm executing this module from another script with runpy. However, this results in (1) the module running twice, and (2) the multiprocessing jobs never finish (the script just hangs).
In my minimal working example, I have a script runpy_test.py:
import runpy
runpy.run_module('module_test')
and a directory module_test containing an empty __init__.py and a __main__.py:
from multiprocessing import Pool
print 'start'
def f(x):
return x*x
pool = Pool()
result = pool.map(f, [1,2,3])
print 'done'
When I run runpy_test.py, I get:
start
start
and the script hangs.
If I remove the pool.map call (or if I run __main__.py directly, including the pool.map call), I get:
start
done
I'm running this on Scientific Linux 7.6 in Python 2.7.5.
Rewrite your __main__.py like so:
from multiprocessing import Pool
from .implementation import f
print 'start'
pool = Pool()
result = pool.map(f, [1,2,3])
print 'done'
And then write an implementation.py (you can call this whatever you want) in which your function is defined:
def f(x):
return x*x
Otherwise you will have the same problem with most interfaces in multiprocessing, and independently of using runpy. As #Weeble explained, when Pool.map tries to load the function f in each sub-process it will import <your_package>.__main__ where your function is defined, but since you have executable code at module-level in __main__ it will be re-executed by the sub-process.
Aside from this technical reason, this is also better design in terms of separation of concerns and testing. Now you can easily import and call (including for test purposes) the function f without running it in parallel.
Try defining your function f in a separate module. It needs to be serialised to be passed to the pool processes, and then those processes need to recreate it, by importing the module it occurs in. However, the __main__.py file it occurs in isn't a module, or at least, not a well-behaved one. Attempting to import it would result in the creation of another Pool and another invocation of map, which seems like a recipe for disaster.
Although not the "right" way to do it, one solution that ended up working for me was to use runpy's _run_module_as_main instead of run_module. This was ideal for me since I was working with someone else's code and required the fewest changes.
Let's say I have three modules:
mod1
mod2
mod3
where each of them runs infinitely long as soon as mod.launch() is called.
What are some elegant ways to launch all these infinite loops at once, without waiting for one to finish before calling the other?
Let's say I'd have a kind of launcher.py, where I'd try to:
import mod1
import mod2
import mod3
if __name__ == "__main__":
mod1.launch()
mod2.launch()
mod3.launch()
This obviously doesn't work, as It will wait for mod1.launch() to finish before launching mod2.launch().
Any kind of help is appreciated.
If you would like to execute multiple functions in parallel, you can use either the multiprocessing library, or concurrent.futures.ProcessPoolExecutor. ProcessPoolExecutor uses multiprocessing internally, but has a simpler interface.
Depending on the nature of the work being done in each task, the answer varies.
If each task is mostly or all IO-bound, I would recommend multithreading.
If each task is CPU-bound, I would recommend multiprocessing (due to the GIL in python).
You can also use the threading module to run each module on a separate thread, but within the same process:
import threading
import mod1
import mod2
import mod3
if __name__ == "__main__":
# make a list of all modules we want to run, for convenience
mods = [mod1, mod2, mod3]
# Prepare a thread for each module to run the `launch()` method
threads = [threading.Thread(target=mod.launch) for mod in mods]
# run all threads
for thread in threads:
thread.start()
# wait for all threads to finish
for thread in threads:
thread.join()
The multiprocess module performs a very similar set of tasks and has a very similar API, but uses separate processes instead of threads, so you can use that too.
I'd suggest using Ray, which is a library for parallel and distributed Python. It has some advantages over the standard threading and multiprocessing libraries.
The same code will run on a single machine or on multiple machines.
You can parallelize both functions and classes.
Objects are shared efficiently between tasks using shared memory.
To provide a simple runnable example, I'll use functions and classes instead of modules, but you can always wrap the module in a function or class.
Approach 1: Parallel functions using tasks.
import ray
import time
ray.init()
#ray.remote
def mod1():
time.sleep(3)
#ray.remote
def mod2():
time.sleep(3)
#ray.remote
def mod3():
time.sleep(3)
if __name__ == '__main__':
# Start the tasks. These will run in parallel.
result_id1 = mod1.remote()
result_id2 = mod2.remote()
result_id3 = mod3.remote()
# Don't exit the interpreter before the tasks have finished.
ray.get([result_id1, result_id2, result_id3])
Approach 2: Parallel classes using actors.
import ray
import time
# Don't run this again if you've already run it.
ray.init()
#ray.remote
class Mod1(object):
def run(self):
time.sleep(3)
#ray.remote
class Mod2(object):
def run(self):
time.sleep(3)
#ray.remote
class Mod3(object):
def run(self):
time.sleep(3)
if __name__ == '__main__':
# Create 3 actors.
mod1 = Mod1.remote()
mod2 = Mod2.remote()
mod3 = Mod3.remote()
# Start the methods, these will run in parallel.
result_id1 = mod1.run.remote()
result_id2 = mod2.run.remote()
result_id3 = mod3.run.remote()
# Don't exit the interpreter before the tasks have finished.
ray.get([result_id1, result_id2, result_id3])
You can see the Ray documentation.
I have the following code:
class SplunkUKAnalyser(object):
def __init__
def method1
def method2
def method2
...
class SplunkDEAnalyser(SplunkUKAnalyser):
def __init__ (Over-ridden)
def method1 (Over-ridden)
def method2
def method2
...
perform_uk_analysis():
my_uk_analyser = SplunkUKAnalyser()
perform_de_analysis():
my_de_analyser = SplunkDEAnalyser()
It all works well if I just execute the below:
perform_uk_analysis()
perform_de_analysis()
How can I make it so that the two last statements are executed concurrently. (using mutliprocessing and/or multi-threading)?
From my test it seems that the second statement executes even though the first statement has not finished completely but I would like to incorporate true concurrency.
Any other additional advice is much appreciated.
Many thanks in advance.
Because of GIL (Global Interpreter Lock) you can not achieve 'true concurrency' with threading.
However, using multiprocessing to concurrently run multiple tasks is easy:
import multiprocessing
process1 = multiprocessing.Process(target=perform_uk_analysis)
process2 = multiprocessing.Process(target=perform_de_analysis)
# you can optionally daemoize the process
process2.daemon = True
# run the tasks concurrently
process1.start()
process2.start()
# you can optionally wait for a process to finish
process2.join()
For tasks that run the same function with different arguments, consider using multiprocessing.Pool, an even more convenient solution.
I have a program that, among other things, parses some big files, and I would like to have this done in parallel to save time.
The code flow looks something like this:
if __name__ == '__main__':
obj = program_object()
obj.do_so_some_stuff(argv)
obj.field1 = parse_file_one(f1)
obj.field2 = parse_file_two(f2)
obj.do_some_more_stuff()
I tried running the file parsing methods in separate processes like this:
p_1 = multiprocessing.Process(target=parse_file_one, args=(f1))
p_2 = multiprocessing.Process(target=parse_file_two, args=(f2))
p_1.start()
p_2.start()
p_1.join()
p_2.join()
There are 2 problems here. One is how to have the separate process modify the filed, but more importantly, forking the process duplicates my whole main! I get exception regarding argv when executing the
do_so_some_stuff(argv)
second time. That really is not what I wanted. It even happened when I run only 1 of the Processes.
How could I get just the file parsing methods to run in parallel to each other, and then continue back with main process like before?
Try putting the parsing methods in a separate module.
First, i guess instead of:
obj = program_object()
program_object.do_so_some_stuff(argv)
you mean:
obj = program_object()
obj.do_so_some_stuff(argv)
Second, try using threading like this:
#!/usr/bin/python
import thread
if __name__ == '__main__':
try:
thread.start_new_thread( parse_file_one, (f1) )
thread.start_new_thread( parse_file_two, (f2) )
except:
print "Error: unable to start thread"
But, as pointed out by Wooble, depending on the implementation of your parsing functions, this might not be a solution that executes truly in parallel, because of the GIL.
In that case, you should check the Python multiprocessing module that will do true concurrent execution:
multiprocessing is a package that supports spawning processes using an
API similar to the threading module. The multiprocessing package
offers both local and remote concurrency, effectively side-stepping
the Global Interpreter Lock by using subprocesses instead of threads.
Due to this, the multiprocessing module allows the programmer to fully
leverage multiple processors on a given machine.
When using multiprocessing.Pool in python with the following code, there is some bizarre behavior.
from multiprocessing import Pool
p = Pool(3)
def f(x): return x
threads = [p.apply_async(f, [i]) for i in range(20)]
for t in threads:
try: print(t.get(timeout=1))
except Exception: pass
I get the following error three times (one for each thread in the pool), and it prints "3" through "19":
AttributeError: 'module' object has no attribute 'f'
The first three apply_async calls never return.
Meanwhile, if I try:
from multiprocessing import Pool
p = Pool(3)
def f(x): print(x)
p.map(f, range(20))
I get the AttributeError 3 times, the shell prints "6" through "19", and then hangs and cannot be killed by [Ctrl] + [C]
The multiprocessing docs have the following to say:
Functionality within this package requires that the main module be
importable by the children.
What does this mean?
To clarify, I'm running code in the terminal to test functionality, but ultimately I want to be able to put this into modules of a web server. How do you properly use multiprocessing.Pool in the python terminal and in code modules?
Caveat: Multiprocessing is the wrong tool to use in the context of web servers like Django and Flask. Instead, you should use a task framework like Celery or an infrastructure solution like Elastic Beanstalk Worker Environments. Using multiprocessing to spawn threads or processes is bad because it gives you no oversight or management of those threads/processes, and so you have to build your own failure detection logic, retry logic, etc. At that point, you are better served by using an off-the-shelf tool that is actually designed to handle asynchronous tasks, because it will give you these out of the box.
Understanding the docs
Functionality within this package requires that the main module be importable by the children.
What this means is that pools must be initialized after the definitions of functions to be run on them. Using pools within if __name__ == "__main__": blocks works if you are writing a standalone script, but this isn't possible in either larger code bases or server code (such as a Django or Flask project). So, if you're trying to use Pools in one of these, make sure to follow these guidelines, which are explained in the sections below:
Initialize Pools inside functions whenever possible. If you have to initialize them in the global scope, do so at the bottom of the module.
Do not call the methods of a Pool in the global scope.
Alternatively, if you only need better parallelism on I/O (like database accesses or network calls), you can save yourself all this headache and use pools of threads instead of pools of processes. This involves the completely undocumented:
from multiprocessing.pool import ThreadPool
It's interface is exactly the same as that of Pool, but since it uses threads and not processes, it comes with none of the caveats that using process pools do, with the only downside being you don't get true parallelism of code execution, just parallelism in blocking I/O.
Pools must be initialized after the definitions of functions to be run on them
The inscrutable text from the python docs means that at the time the pool is defined, the surrounding module is imported by the threads in the pool. In the case of the python terminal, this means all and only code you have run so far.
So, any functions you want to use in the pool must be defined before the pool is initialized. This is true both of code in a module and code in the terminal. The following modifications of the code in the question will work fine:
from multiprocessing import Pool
def f(x): return x # FIRST
p = Pool(3) # SECOND
threads = [p.apply_async(f, [i]) for i in range(20)]
for t in threads:
try: print(t.get(timeout=1))
except Exception: pass
Or
from multiprocessing import Pool
def f(x): print(x) # FIRST
p = Pool(3) # SECOND
p.map(f, range(20))
By fine, I mean fine on Unix. Windows has it's own problems, that I'm not going into here.
Using pools in modules
But wait, there's more (to using pools in modules that you want to import elsewhere)!
If you define a pool inside a function, you have no problems. But if you are using a Pool object as a global variable in a module, it must be defined at the bottom of the page, not the top. Though this goes against most good code style, it is necessary for functionality. The way to use a pool declared at the top of a page is to only use it with functions imported from other modules, like so:
from multiprocessing import Pool
from other_module import f
p = Pool(3)
p.map(f, range(20))
Importing a pre-configured pool from another module is pretty horrific, as the import must come after whatever you want to run on it, like so:
### module.py ###
from multiprocessing import Pool
POOL = Pool(5)
### module2.py ###
def f(x):
# Some function
from module import POOL
POOL.map(f, range(10))
And second, if you run anything on the pool in the global scope of a module that you are importing, the system hangs. i.e. this doesn't work:
### module.py ###
from multiprocessing import Pool
def f(x): return x
p = Pool(1)
print(p.map(f, range(5)))
### module2.py ###
import module
This, however, does work, as long as nothing imports module2:
### module.py ###
from multiprocessing import Pool
def f(x): return x
p = Pool(1)
def run_pool(): print(p.map(f, range(5)))
### module2.py ###
import module
module.run_pool()
Now, the reasons behind this are only more bizarre, and likely related to the reason that the code in the question only spits an Attribute Error once each and after that appear to execute code properly. It also appears that pool threads (at least with some reliability) reload the code in module after executing.
The function you want to execute on a thread pool must be already defined when you create the pool.
This should work:
from multiprocessing import Pool
def f(x): print(x)
if __name__ == '__main__':
p = Pool(3)
p.map(f, range(20))
The reason is that (at least on Unix-based systems, which have fork) when you create a pool the workers are created by forking the current process. So if the target function isn't already defined at that point, the worker won't be able to call it.
On Windows it's a bit different, as Windows doesn't have fork. Here new worker processes are started and the main module is imported. That's why on Windows it's important to protect the executing code with a if __name__ == '__main__'. Otherwise each new worker will re-execute the code and therefore spawn new processes infinitely, crashing the program (or the system).
There is another possible source for this error. I got this error when running the example code.
The source was that despite having installed multiprosessing correctly, the C++ compiler was not installed on my system, something pip informed me of when trying to update multiprocessing. So It might be worth checking that the compiler is installed.