Post-processing results after multi-processing in Python - python

So I have a simple MP code and it works like a charm. However, when I do a very simple post processing on the data generated via MP, the code does not work anymore. It never stops and runs like forever! This is the code (and again it works perfectly):
import numpy as np
from multiprocessing import Pool
n = 4
nMCS = 10**5
def my_function(j):
result = []
for j in range(nMCS // n):
a = np.random.rand(10,2)
result.append(a)
return result
if __name__ == '__main__':
__spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)" # this is because I am using Spyder!
pool = Pool(processes = n)
data = pool.map(my_function, [i for i in range(n)])
pool.close()
pool.join()
#final_result = np.concatenate(data) ### this is what ruins my code! ###
Meanwhile, if I add final_result = np.concatenate(data) at the end, it never works! I am using Spyder and if I simply type final_result = np.concatenate(data) in the console AFTER MP is done, it gives me what I want i.e. a concatenated list. However, if I put that simple line in the main program at the very end, it just doesn't work. Could anyone tell me how to fix this?
P.S. this is a very simple example I generated so you can understand what is going on; my real problem is way more complicated and there is no way I can do post processing after I am done with MP.

As #Ares already implied, you fix the problem by indenting everything south the if __name__ == "__main__"-statement into the if-block.
FYI, this happens on Windows which doesn't provide forking for starting up new processes like Unix-y systems, but uses 'spawn' as default (and only) start-method. Spawn means, the OS has to boot a new process with an interpreter from scratch for every worker-process.
Your worker-processes will need to import your target function my_function. When this happens, everything not protected within the if __name__ == "__main__":-block will also run in every child-process on import.

Your problem is that when you run np.concatenate, it's not done in the main function. I suspect that the problem you're encountering is Spyder specific, but updating the indentation should fix it.

Related

Trouble writing python output to file with multiprocessing pool

I have found some similar questions to mine on this forum but none that has helped with my specific problem (I'm fairly new to python so forgive me if I should've understood the answer from a different thread). I am actually working with a fairly complicated python code, but to illustrate the problem I've having I have written a very minimal code that exhibits the same behavior:
from multiprocessing import Pool
log_output = open('log','w')
def sum_square(number):
s = 0
for i in range(number):
s += i * i
log_output.write(str(s)+'\n')
return s
if __name__ == "__main__":
numbers = range(5)
p = Pool()
result = p.map(sum_square,numbers)
p.close()
p.join()
When this code is executed, there are no errors or warnings, etc., but the resulting file "log" is always totally empty. This is the same behavior I find in my other more complicated python program also using a multiprocessing pool. I had expected that the processes would probably print to file in the wrong order, but I'm confused why nothing prints at all. If I get rid of the multiprocessing and just run the same basic code with a single process, it prints to file just fine. Can anyone help me understand what I'm doing wrong here? Thanks.
I think you're well into undefined territory here, but just flushing the file works for me. for example:
from multiprocessing import Pool
log_output = open('log', 'w')
def sum_square(number):
print(number, file=log_output)
log_output.flush()
if __name__ == "__main__":
with Pool() as p:
p.map(sum_square, range(5))
works for me in Python 3.8 under Linux. I've also shortened your code to clarify the issue, and make it slightly more idiomatic.

Multiprocessing only using a single thread instead of multiple

This question has been asked and solved a few times recently but I have quite a specific example...
I have a multiprocessing function that was working absolutely fine in complete isolation yesterday (in an interactive notebook), however, I decided to parameterise so I can call it as part of a larger pipeline & for abstraction/cleaner notebook and now it's only using a single thread instead of 6.
import pandas as pd
import multiprocessing as mp
from multiprocessing import get_context
mp.set_start_method('forkserver')
def multiprocess_function(func, iterator, input_data):
result_list = []
def append_result(result):
result_list.append(result)
with get_context('fork').Pool(processes=6) as pool:
for i in iterator:
pool.apply_async(func, args = (i, input_data), callback = append_result)
pool.close()
pool.join()
return result_list
multiprocess_function(count_live, run_weeks, base_df)
My previous version of the code executed differently, instead of a return / call I was using the following at the bottom of the function (which doesn't work at all now I've parameterised - even with the args assigned)
if __name__ == '__main__':
multiprocess_function()
The function executes fine, just only operates across one thread as per the output in top.
Apologies if this is something incredibly simple - I'm not a programmer, I'm an analyst :)
edit: everything works absolutely fine if I include the if__name__ =='main': etc at the bottom of the function and execute the cell, however, when I do this I have to remove the parameters - maybe just something to do with scoping. If I execute by calling the function, whether it is parameterised or not, it only operates on a single thread.
You've got two problems:
You're not using an import guard.
You're not setting the default start method inside the import guard.
Between the two of them, you end up telling Python to spawn the forkserver inside the forkserver, which can only cause you grief. Change the structure of your code to:
import pandas as pd
import multiprocessing as mp
from multiprocessing import get_context
def multiprocess_function(func, iterator, input_data):
result_list = []
with get_context('fork').Pool(processes=6) as pool:
for i in iterator:
pool.apply_async(func, args=(i, input_data), callback=result_list.append)
pool.close()
pool.join()
return result_list
if __name__ == '__main__':
mp.set_start_method('forkserver')
multiprocess_function(count_live, run_weeks, base_df)
Since you didn't show where you got count_live, run_weeks and base_df from, I'll just say that for the code as written, they should be defined in the guarded section (since nothing relies on them as a global).
There are other improvements to be made (apply_async is being used in a way that makes me thing you really just wanted to listify the result of pool.imap_unordered, without the explicit loop), but that's fixing the big issues that will wreck use of spawn or forkserver start methods.
using "get_context('spawn') " instead of "get_context('fork')" maybe will solve your problem

Python multiprocessing, using pool multiple times in a loop gets stuck after first iteration

I have the following situation, where I create a pool in a for loop as follows (I know it's not very elegant, but I have to do this for pickling reasons). Assume that the pathos.multiprocessing is equivalent to python's multiprocessing library (as it is up to some details, that are not relevant for this problem).
I have the following code I want to execute:
self.pool = pathos.multiprocessing.ProcessingPool(number_processes)
for i in range(5):
all_responses = self.pool.map(wrapper_singlerun, range(self.no_of_restarts))
pool._clear()
Now my problem: The loop successfully runs the first iteration. However, at the second iteration, the algorithm suddenly stops (Does not finish the pool.map operation. I suspected that zombie processes are generated, or that the process was somehow switched. Below you will find everything I have tried so far.
for i in range(5):
pool = pathos.multiprocessing.ProcessingPool(number_processes)
all_responses = self.pool.map(wrapper_singlerun, range(self.no_of_restarts))
pool._clear()
gc.collect()
for p in multiprocessing.active_children():
p.terminate()
gc.collect()
print("We have so many active children: ", multiprocessing.active_children()) # Returns []
The above code works perfectly well on my mac. However, when I upload it on the cluster that has the following specs, I get the error that it gets stuck after the first iteration:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04 LTS"
This is the link to the pathos' multiprocessing library file is
I am assuming that you are trying to call this via some function which is not the correct way to use this.
You need to wrap it around with :
if __name__ == '__main__':
for i in range(5):
pool = pathos.multiprocessing.Pool(number_processes)
all_responses = pool.map(wrapper_singlerun,
range(self.no_of_restarts))
If you don't do it will keep on creating a copy of itself and will start putting it into stack which will ultimately fill the stack and block everything. The reason it works on mac is that it has fork while windows does not have it.

multiprocessing.Pool in jupyter notebook works on linux but not windows

I'm trying to run a few independent computations (though reading from the same data). My code works when I run it on Ubuntu, but not on Windows (windows server 2012 R2), where I get the error:
'module' object has no attribute ...
when I try to use multiprocessing.Pool (it appears in the kernel console, not as output in the notebook itself)
(And I've already made the mistake of defining the function AFTER creating the pool, and I've also corrected it, that's not the problem).
This happens even on the simplest of examples:
from multiprocessing import Pool
def f(x):
return x**2
pool = Pool(4)
for res in pool.map(f,range(20)):
print res
I know that it needs to be able to import the module (and I have no idea how this works when working in the notebook), and I've heard of IPython.Parallel, but I have been unable to find any documentation or examples.
Any solutions/alternatives would be most welcome.
I would post this as a comment since I don't have a full answer, but I'll amend as I figure out what is going on.
from multiprocessing import Pool
def f(x):
return x**2
if __name__ == '__main__':
pool = Pool(4)
for res in pool.map(f,range(20)):
print(res)
This works. I believe the answer to this question is here. In short, the subprocesses do not know they are subprocesses and are attempting to run the main script recursively.
This is the error I am given, which gives us the same solution:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.

Python multiprocessing hanging on pool.join()

I'm having problems with Python freezing when I try to use the multiprocessing module. I'm using Spyder 2.3.2 with Python 3.4.3 (I have previously encountered problems that were specific to iPython).
I've reduced it to the following MWE:
import multiprocessing
def test_function(arg1=1,arg2=2):
print("arg1 = {0}, arg2 = {1}".format(arg1,arg2))
return None
pool = multiprocessing.Pool(processes=3)
for i in range(6):
pool.apply_async(test_function)
pool.close()
pool.join()
This, in its current form, should just produce six identical iterations of test_function. However, while I can enter the commands with no hassle, when I give the command pool.join(), iPython hangs, and I have to restart the kernel.
The function works perfectly well when done in serial (the next step in my MWE would be to use pool.apply_async(test_function,kwds=entry).
for i in range(6):
test_function()
arg_list = [{'arg1':3,'arg2':4},{'arg1':5,'arg2':6},{'arg1':7,'arg2':8}]
for entry in arg_list:
test_function(**entry)
I have (occasionally, and I'm unable to reliably reproduce it) come across an error message of ZMQError: Address already in use, which led me to this bug report, but preceding my code with either multiprocessing.set_start_method('spawn') or multiprocessing.set_start_method('forkserver') doesn't seem to work.
Can anyone offer any help/advice? Thanks in advance if so.
#Anarkopsykotik is correct: you must use a main, and you can get it to print by returning a result to the main thread.
Here's a working example.
import multiprocessing
import os
def test_function(arg1=1,arg2=2):
string="arg1 = {0}, arg2 = {1}".format(arg1,arg2) +" from process id: "+ str(os.getpid())
return string
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=3)
for i in range(6):
result = pool.apply_async(test_function)
print(result.get(timeout=1))
pool.close()
pool.join()
Two things pop to my mind that might cause problems.
First, in the doc, there is a warning about using the interactive interpreter with multiprocessing module :
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers
Functionality within this package requires that the main module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the Pool examples will not work in the interactive interpreter.
Second: you might want to retrieve a string with your async function, and then display it from your main thread. I am not quite sure child threads have access to standard output, which might be locked to the main thread.

Categories

Resources