My code is
from some_module import foo, tasks, condition
import multiprocessing
def parallel():
pool = multiprocessing.Pool()
pool.map(foo, tasks)
pool.close()
pool.join()
if __name__ == '__main__' and condition:
parallel()
My question is, what is the flow of this code? As I understand it, first time, it goes into the if-block, then goes into parallel, where it creates a pool, and once I call .map, it creates new child processes, and these child processes will … do what, exactly? Will they run the entire script? I know that means it won't go into the if-block, but if that's the case, how the hell does it know what to do? It's only the if-block which tells it to run a function...
Related
I'm trying to run a bunch of processes in parallel with the Process Pool Executor from concurrent futures in Python.
The processes are all running in parallel in a while loop which is great, but for some reason the code outside of the main method repeatedly runs. I saw another answer say to use the name == main check to fix but it still doesn't work.
Any ideas how I can just get the code inside the main method to run? My object keeps getting reset repeatedly.
EDIT: I ran my code using ThreadPoolExecutor instead and it fixed the problem, although I'm still curious about this.
import concurrent.futures
import time
from myFile import myObject
obj = myObject()
def main():
with concurrent.futures.ProcessPoolExecutor() as executor:
while condition:
for index in range(0,10):
executor.submit(obj.function, index, index+1)
executor.submit(obj.function2)
time.sleep(5)
print("test")
if __name__ == "__main__":
main()
I am using multiprocessing python module to run parallel and unrelated jobs with a function similar to the following example:
import numpy as np
from multiprocessing import Pool
def myFunction(arg1):
name = "file_%s.npy"%arg1
A = np.load(arg1)
A[A<0] = np.nan
np.save(arg1,A)
if(__name__ == "__main__"):
N = list(range(50))
with Pool(4) as p:
p.map_async(myFunction, N)
p.close() # I tried with and without that statement
p.join() # I tried with and without that statement
DoOtherStuff()
My problem is that the function DoOtherStuff is never executed, the processes switches into sleep mode on top and I need to kill it with ctrl+C to stop it.
Any suggestions?
You have at least a couple problems. First, you are using map_async() which does not block until the results of the task are completed. So what you're doing is starting the task with map_async(), but then immediately closes and terminates the pool (the with statement calls Pool.terminate() upon exiting).
When you add tasks to a Process pool with methods like map_async it adds tasks to a task queue which is handled by a worker thread which takes tasks off that queue and farms them out to worker processes, possibly spawning new processes as needed (actually there is a separate thread which handles that).
Point being, you have a race condition where you're terminating the Pool likely before any tasks are even started. If you want your script to block until all the tasks are done just use map() instead of map_async(). For example, I rewrote your script like this:
import numpy as np
from multiprocessing import Pool
def myFunction(N):
A = np.load(f'file_{N:02}.npy')
A[A<0] = np.nan
np.save(f'file2_{N:02}.npy', A)
def DoOtherStuff():
print('done')
if __name__ == "__main__":
N = range(50)
with Pool(4) as p:
p.map(myFunction, N)
DoOtherStuff()
I don't know what your use case is exactly, but if you do want to use map_async(), so that this task can run in the background while you do other stuff, you have to leave the Pool open, and manage the AsyncResult object returned by map_async():
result = pool.map_async(myFunction, N)
DoOtherStuff()
# Is my map done yet? If not, we should still block until
# it finishes before ending the process
result.wait()
pool.close()
pool.join()
You can see more examples in the linked documentation.
I don't know why in your attempt you got a deadlock--I was not able to reproduce that. It's possible there was a bug at some point that was then fixed, though you were also possibly invoking undefined behavior with your race condition, as well as calling terminate() on a pool after it's already been join()ed. As for your why your answer did anything at all, it's possible that with the multiple calls to apply_async() you managed to skirt around the race condition somewhat, but this is not at all guaranteed to work.
Based on this answer (https://stackoverflow.com/a/20192251/9024698), I have to do this:
from multiprocessing import Pool
def process_image(name):
sci=fits.open('{}.fits'.format(name))
<process>
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
pool.map(process_image, data_inputs) # process data_inputs iterable with pool
to multi-process a for loop.
However, I am wondering, how can I get the output of this and further process if I want?
It must be like that:
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
output = pool.map(process_image, data_inputs) # process data_inputs iterable with pool
# further processing
But then this means that I have to put all the rest of my code in __main__ unless I write everything in functions which are called by __main__?
The notion of __main__ has been always pretty confusing to me.
if __name__ == '__main__': is literally just "if this file is being run as a script, as opposed to being imported as a module, then do this". __name__ is a hidden variable that gets set to '__main__' if it's being run as a script. why it works this way is beyond the scope of this discussion but suffice it to say it has to do with how python evaluates sourcefiles top-to-bottom.
In other words, you can put the other two lines anywhere you want - in a function, probably, that you call elsewhere in the program. You could return output from that function, or do other processing on it, or etc., whatever you happen to need.
In Python 2.7, is there's a way to identify if the current forked/spawned process is a child process instance (as opposed to being starting as a regular process). My goal is to set a global variable differently if it's a child process (e.g. create a pool with size 0 for child else pool with some number greater than 0).
I can't pass a parameter into the function (being called to execute in the child process), as even before the function is invoked the process would have been initialized and hence the global variable (especially for spawned process).
Also I am not in a position to use freeze_support (unless of course I am miss understood how to use it) as my application is running in a web service container (flask). Hence there's no main method.
Any help will be much appreciated.
Sample code that goes into infinite loop if you run it on windows:
from multiprocessing import Pool, freeze_support
p = Pool(5) # This should be created only in parent process and not the child process
def f(x):
return x*x
if __name__ == '__main__':
freeze_support()
print(p.map(f, [1, 2, 3]))
I would suggest restructuring your program to something more like my example code below. You mentioned that you don't have a main function, but you can create a wrapper that handles your pool:
from multiprocessing import Pool, freeze_support
def f(x):
return x*x
def handle_request():
p = Pool(5) # pool will only be in the parent process
print(p.map(f, [1, 2, 3]))
p.close() # remember to clean up the resources you use
p.join()
return
if __name__ == '__main__':
freeze_support() # do you really need this?
# start your web service here and make it use `handle_request` as the callback
# when a request needs to be serviced
It sounds like you are having a bit of an XY problem. You shouldn't be making a pool of processes global. It's just bad. You're giving your subprocesses access to their own process objects, which allows you to accidentally do bad things, like make a child process join itself. If you create your pool within a wrapper that is called for each request, then you don't need to worry about a global variable.
In the comments, you mentioned that you want a persistent pool. There is indeed some overhead to creating a pool on each request, but it's far safer than having a global pool. Also, you now have the capability to handle multiple requests simultaneously, assuming your web service handles each request in their own thread/process, without multiple requests trampling on each other by trying to use the same pool. I would strongly suggest you try to use this approach, and if it doesn't meet your performance specifications, you can look at optimizing it in other ways (ie, no global pool) to meet your spec.
One other note: multiprocessing.freeze_support() only needs to be called if you intend to bundle your scripts into a Windows executable. Don't use it if you are not doing that.
Move the pool creation into the main section to only create a multiprocessing pool once, any only in the main process:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
This works because the only process that is executing in the __main__ name is the original process. Spawned processes run with the __mp_main__ module name.
create a pool with size 0 for child
The child processes should never start a new multiprocessing pool. Only handle your processes from a single entry point.
I am using multiprocessing to speed up my program and there is an enigma I can not solve.
I am using multiprocessing to write a lot of short files (based on a lot of input files) with the function writing_sub_file, and I finally concatenate all these files after the end of all the processes, using the function my_concat. Here are two samples of code. Note that this code is in my main .py file, but the function my_concat is imported from another module. The first one:
if __name__ == '__main__':
pool = Pool(processes=cpu_count())
arg_tuple = (work_path, article_dict, cat_len, date_to, time_period, val_matrix)
jobs = [(group, arg_tuple) for group in store_groups]
pool.apply_async(writing_sub_file, jobs)
pool.close()
pool.join()
my_concat(work_path)
which gives many errors (as many as there are processes in the pool) since It tries to apply my_concat before all my processes are done (I don't give the stack of the error since It is very clear that my_concat function tries to apply before every files have been written by the pool processes).
The second one:
if __name__ == '__main__':
pool = Pool(processes=cpu_count())
arg_tuple = (work_path, article_dict, cat_len, date_to, time_period, val_matrix)
jobs = [(group, arg_tuple) for group in store_groups]
pool.apply_async(writing_sub_file, jobs)
pool.close()
pool.join()
my_concat(work_path)
which works perfectly.
Can someone explain me the reason?
In the second, my_concat(work_path) is inside the if statement, and is therefore only executed if the script is running as the main script.
In the first, my_concat(work_path) is outside the if statement. When multiprocessing imports the module in a new Python session, it is not imported as __main__ but under its own name. Therefore this statement is run pretty much immediately, in each of your pool's processes, when your module is imported into that process.