Based on this answer (https://stackoverflow.com/a/20192251/9024698), I have to do this:
from multiprocessing import Pool
def process_image(name):
sci=fits.open('{}.fits'.format(name))
<process>
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
pool.map(process_image, data_inputs) # process data_inputs iterable with pool
to multi-process a for loop.
However, I am wondering, how can I get the output of this and further process if I want?
It must be like that:
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
output = pool.map(process_image, data_inputs) # process data_inputs iterable with pool
# further processing
But then this means that I have to put all the rest of my code in __main__ unless I write everything in functions which are called by __main__?
The notion of __main__ has been always pretty confusing to me.
if __name__ == '__main__': is literally just "if this file is being run as a script, as opposed to being imported as a module, then do this". __name__ is a hidden variable that gets set to '__main__' if it's being run as a script. why it works this way is beyond the scope of this discussion but suffice it to say it has to do with how python evaluates sourcefiles top-to-bottom.
In other words, you can put the other two lines anywhere you want - in a function, probably, that you call elsewhere in the program. You could return output from that function, or do other processing on it, or etc., whatever you happen to need.
Related
My code is
from some_module import foo, tasks, condition
import multiprocessing
def parallel():
pool = multiprocessing.Pool()
pool.map(foo, tasks)
pool.close()
pool.join()
if __name__ == '__main__' and condition:
parallel()
My question is, what is the flow of this code? As I understand it, first time, it goes into the if-block, then goes into parallel, where it creates a pool, and once I call .map, it creates new child processes, and these child processes will … do what, exactly? Will they run the entire script? I know that means it won't go into the if-block, but if that's the case, how the hell does it know what to do? It's only the if-block which tells it to run a function...
I have not been able to implement the suggestion here: Applying two functions to two lists simultaneously.
I guess it is because the module is imported by another module and thus my Windows spawns multiple python processes?
My question is: how can I use the code below without the if if __name__ == "__main__":
args_m = [(mortality_men, my_agents, graveyard, families, firms, year, agent) for agent in males]
args_f = [(mortality_women, fertility, year, families, my_agents, graveyard, firms, agent) for agent in females]
with mp.Pool(processes=(mp.cpu_count() - 1)) as p:
p.map_async(process_males, args_m)
p.map_async(process_females, args_f)
Both process_males and process_females are fuctions.
args_m, args_f are iterators
Also, I don't need to return anything. Agents are class instances that need updating.
The reason you need to guard multiprocessing code in a if __name__ == "__main__" is that you don't want it to run again in the child process. That can happen on Windows, where the interpreter needs to reload all of its state since there's no fork system call that will copy the parent process's address space. But you only need to use it where code is supposed to be running at the top level since you're in the main script. It's not the only way to guard your code.
In your specific case, I think you should put the multiprocessing code in a function. That won't run in the child process, as long as nothing else calls the function when it should not. Your main module can import the module, then call the function (from within an if __name__ == "__main__" block, probably).
It should be something like this:
some_module.py:
def process_males(x):
...
def process_females(x):
...
args_m = [...] # these could be defined inside the function below if that makes more sense
args_f = [...]
def do_stuff():
with mp.Pool(processes=(mp.cpu_count() - 1)) as p:
p.map_async(process_males, args_m)
p.map_async(process_females, args_f)
main.py:
import some_module
if __name__ == "__main__":
some_module.do_stuff()
In your real code you might want to pass some arguments or get a return value from do_stuff (which should also be given a more descriptive name than the generic one I've used in this example).
The idea of if __name__ == '__main__': is to avoid infinite process spawning.
When pickling a function defined in your main script, python has to figure out what part of your main script is the function code. It will basically re run your script. If your code creating the Pool is in the same script and not protected by the "if main", then by trying to import the function, you will try to launch another Pool that will try to launch another Pool....
Thus you should separate the function definitions from the actual main script:
from multiprocessing import Pool
# define test functions outside main
# so it can be imported withou launching
# new Pool
def test_func():
pass
if __name__ == '__main__':
with Pool(4) as p:
r = p.apply_async(test_func)
... do stuff
result = r.get()
Cannot yet comment on the question, but a workaround I have used that some have mentioned is just to define the process_males etc. functions in a module that is different to where the processes are spawned. Then import the module containing the multiprocessing spawns.
I solved it by calling the modules' multiprocessing function within "if __ name__ == "__ main__":" of the main script, as the function that involves multiprocessing is the last step in my module, others could try if aplicable.
I've been a long-term observer of Stack Overflow but this time I just can't find a solution to my problem, so here I am asking you directly!
Consider this code:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
print "External"
It's the basic example code for multiprocessing with pools as found here in the first box, plus a print statement at the end.
When executing this in PyCharm Community on Windows 7, Python 2.7, the Pool part works fine, but "External" is printed multiple times, too. As a result, when I try to use Multithreading on a specific function in another program, all processes end up running the entire program. How do I prevent that, so only the given function is multiprocessed?
I tried using Process instead, closing, joining and/or terminating the process or pool, embedding the entire thing into a function, calling said function from a different file (it then starts executing that file). I can't find anything related to my problem and feel like I'm missing something very simple.
Since the print instruction is not indented, it is executed each time the python file is imported. Which is, every time a new process is created.
On the opposite, all the code placed beneath if __name__ == '__main__ will not be executed each time a process is created, but only from the main process, which is the only place where the instruction evaluates to true.
Try the following code, you should not see the issue again. You should expect to see External printed to the console only once.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
print "External"
Related: python multiprocessing on windows, if __name__ == "__main__"
In Python 2.7, is there's a way to identify if the current forked/spawned process is a child process instance (as opposed to being starting as a regular process). My goal is to set a global variable differently if it's a child process (e.g. create a pool with size 0 for child else pool with some number greater than 0).
I can't pass a parameter into the function (being called to execute in the child process), as even before the function is invoked the process would have been initialized and hence the global variable (especially for spawned process).
Also I am not in a position to use freeze_support (unless of course I am miss understood how to use it) as my application is running in a web service container (flask). Hence there's no main method.
Any help will be much appreciated.
Sample code that goes into infinite loop if you run it on windows:
from multiprocessing import Pool, freeze_support
p = Pool(5) # This should be created only in parent process and not the child process
def f(x):
return x*x
if __name__ == '__main__':
freeze_support()
print(p.map(f, [1, 2, 3]))
I would suggest restructuring your program to something more like my example code below. You mentioned that you don't have a main function, but you can create a wrapper that handles your pool:
from multiprocessing import Pool, freeze_support
def f(x):
return x*x
def handle_request():
p = Pool(5) # pool will only be in the parent process
print(p.map(f, [1, 2, 3]))
p.close() # remember to clean up the resources you use
p.join()
return
if __name__ == '__main__':
freeze_support() # do you really need this?
# start your web service here and make it use `handle_request` as the callback
# when a request needs to be serviced
It sounds like you are having a bit of an XY problem. You shouldn't be making a pool of processes global. It's just bad. You're giving your subprocesses access to their own process objects, which allows you to accidentally do bad things, like make a child process join itself. If you create your pool within a wrapper that is called for each request, then you don't need to worry about a global variable.
In the comments, you mentioned that you want a persistent pool. There is indeed some overhead to creating a pool on each request, but it's far safer than having a global pool. Also, you now have the capability to handle multiple requests simultaneously, assuming your web service handles each request in their own thread/process, without multiple requests trampling on each other by trying to use the same pool. I would strongly suggest you try to use this approach, and if it doesn't meet your performance specifications, you can look at optimizing it in other ways (ie, no global pool) to meet your spec.
One other note: multiprocessing.freeze_support() only needs to be called if you intend to bundle your scripts into a Windows executable. Don't use it if you are not doing that.
Move the pool creation into the main section to only create a multiprocessing pool once, any only in the main process:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
This works because the only process that is executing in the __main__ name is the original process. Spawned processes run with the __mp_main__ module name.
create a pool with size 0 for child
The child processes should never start a new multiprocessing pool. Only handle your processes from a single entry point.
I am using multiprocessing to speed up my program and there is an enigma I can not solve.
I am using multiprocessing to write a lot of short files (based on a lot of input files) with the function writing_sub_file, and I finally concatenate all these files after the end of all the processes, using the function my_concat. Here are two samples of code. Note that this code is in my main .py file, but the function my_concat is imported from another module. The first one:
if __name__ == '__main__':
pool = Pool(processes=cpu_count())
arg_tuple = (work_path, article_dict, cat_len, date_to, time_period, val_matrix)
jobs = [(group, arg_tuple) for group in store_groups]
pool.apply_async(writing_sub_file, jobs)
pool.close()
pool.join()
my_concat(work_path)
which gives many errors (as many as there are processes in the pool) since It tries to apply my_concat before all my processes are done (I don't give the stack of the error since It is very clear that my_concat function tries to apply before every files have been written by the pool processes).
The second one:
if __name__ == '__main__':
pool = Pool(processes=cpu_count())
arg_tuple = (work_path, article_dict, cat_len, date_to, time_period, val_matrix)
jobs = [(group, arg_tuple) for group in store_groups]
pool.apply_async(writing_sub_file, jobs)
pool.close()
pool.join()
my_concat(work_path)
which works perfectly.
Can someone explain me the reason?
In the second, my_concat(work_path) is inside the if statement, and is therefore only executed if the script is running as the main script.
In the first, my_concat(work_path) is outside the if statement. When multiprocessing imports the module in a new Python session, it is not imported as __main__ but under its own name. Therefore this statement is run pretty much immediately, in each of your pool's processes, when your module is imported into that process.