Pathos multiprocessing in class produces garbled standard output - python

I'm trying to use multiprocessing in a class I have written to speed up calculations. I'm using pathos.multiprocessing and dill, and using map on a ProcessingPool. I've tested the functionality of multiprocessing in a console and it performed as expected. The issue I'm having is that when I try to implement it in my code, as soon as it calls pool.map the terminal I'm using starts spitting out ridiculous nonsense. The output is recognizable as being from the code, but I have no idea how it's making it print. Some of it comes from a method like I defined below, where it includes the current datetime. In the nonsense I see that it's printing the current time, after pool.map was called, so this isn't just something that's just being repeatedly printed out, it's new output. Here is a little code illustrating how I'm using multiprocessing.
My_func is a little more complicated than I have below, but as a first step I changed it to literally what is written below, and the problem still persists.
Additionally, Ctr-C does trigger a KeyboardInterrupt, but does not completely stop the program. I'm using Visual Studio and python 2.7.13 on Windows 10.
from pathos.multiprocessing import ProcessingPool
import dill
import datetime
class my_class(Object):
def __init__(self):
pool = ProcessingPool(nodes=4)
p1 = [1,2,3]
p2 = [4,5,6]
p3 = [7,8,9]
results = pool.map(self.my_func, p1, p2, p3)
def my_func(self,x,y,z):
print(x,y,z)
def status_printout(self,message):
header = datetime.datetime.now().strftime('%Y/%m/%d %H:%M:%S')
print(header+' -- '+message)

Try using a Lock to ensure only one of the subprocesses writes to stdout at a time.

I was not using the suggested
if __name__ == '__main__':
freeze_support()
for Windows. Things are behaving normally now.

Related

Multiprocessing only using a single thread instead of multiple

This question has been asked and solved a few times recently but I have quite a specific example...
I have a multiprocessing function that was working absolutely fine in complete isolation yesterday (in an interactive notebook), however, I decided to parameterise so I can call it as part of a larger pipeline & for abstraction/cleaner notebook and now it's only using a single thread instead of 6.
import pandas as pd
import multiprocessing as mp
from multiprocessing import get_context
mp.set_start_method('forkserver')
def multiprocess_function(func, iterator, input_data):
result_list = []
def append_result(result):
result_list.append(result)
with get_context('fork').Pool(processes=6) as pool:
for i in iterator:
pool.apply_async(func, args = (i, input_data), callback = append_result)
pool.close()
pool.join()
return result_list
multiprocess_function(count_live, run_weeks, base_df)
My previous version of the code executed differently, instead of a return / call I was using the following at the bottom of the function (which doesn't work at all now I've parameterised - even with the args assigned)
if __name__ == '__main__':
multiprocess_function()
The function executes fine, just only operates across one thread as per the output in top.
Apologies if this is something incredibly simple - I'm not a programmer, I'm an analyst :)
edit: everything works absolutely fine if I include the if__name__ =='main': etc at the bottom of the function and execute the cell, however, when I do this I have to remove the parameters - maybe just something to do with scoping. If I execute by calling the function, whether it is parameterised or not, it only operates on a single thread.
You've got two problems:
You're not using an import guard.
You're not setting the default start method inside the import guard.
Between the two of them, you end up telling Python to spawn the forkserver inside the forkserver, which can only cause you grief. Change the structure of your code to:
import pandas as pd
import multiprocessing as mp
from multiprocessing import get_context
def multiprocess_function(func, iterator, input_data):
result_list = []
with get_context('fork').Pool(processes=6) as pool:
for i in iterator:
pool.apply_async(func, args=(i, input_data), callback=result_list.append)
pool.close()
pool.join()
return result_list
if __name__ == '__main__':
mp.set_start_method('forkserver')
multiprocess_function(count_live, run_weeks, base_df)
Since you didn't show where you got count_live, run_weeks and base_df from, I'll just say that for the code as written, they should be defined in the guarded section (since nothing relies on them as a global).
There are other improvements to be made (apply_async is being used in a way that makes me thing you really just wanted to listify the result of pool.imap_unordered, without the explicit loop), but that's fixing the big issues that will wreck use of spawn or forkserver start methods.
using "get_context('spawn') " instead of "get_context('fork')" maybe will solve your problem

Post-processing results after multi-processing in Python

So I have a simple MP code and it works like a charm. However, when I do a very simple post processing on the data generated via MP, the code does not work anymore. It never stops and runs like forever! This is the code (and again it works perfectly):
import numpy as np
from multiprocessing import Pool
n = 4
nMCS = 10**5
def my_function(j):
result = []
for j in range(nMCS // n):
a = np.random.rand(10,2)
result.append(a)
return result
if __name__ == '__main__':
__spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)" # this is because I am using Spyder!
pool = Pool(processes = n)
data = pool.map(my_function, [i for i in range(n)])
pool.close()
pool.join()
#final_result = np.concatenate(data) ### this is what ruins my code! ###
Meanwhile, if I add final_result = np.concatenate(data) at the end, it never works! I am using Spyder and if I simply type final_result = np.concatenate(data) in the console AFTER MP is done, it gives me what I want i.e. a concatenated list. However, if I put that simple line in the main program at the very end, it just doesn't work. Could anyone tell me how to fix this?
P.S. this is a very simple example I generated so you can understand what is going on; my real problem is way more complicated and there is no way I can do post processing after I am done with MP.
As #Ares already implied, you fix the problem by indenting everything south the if __name__ == "__main__"-statement into the if-block.
FYI, this happens on Windows which doesn't provide forking for starting up new processes like Unix-y systems, but uses 'spawn' as default (and only) start-method. Spawn means, the OS has to boot a new process with an interpreter from scratch for every worker-process.
Your worker-processes will need to import your target function my_function. When this happens, everything not protected within the if __name__ == "__main__":-block will also run in every child-process on import.
Your problem is that when you run np.concatenate, it's not done in the main function. I suspect that the problem you're encountering is Spyder specific, but updating the indentation should fix it.

multiprocessing pool example does not work and freeze the kernel

I'm trying to parallelize a script, but for an unknown reason the kernel just freeze without any errors thrown.
minimal working example:
from multiprocessing import Pool
def f(x):
return x*x
p = Pool(6)
print(p.map(f, range(10)))
Interestingly, all works fine if I define my function in another file then import it. How can I make it work without the need of another file?
I work with spyder (anaconda) and I have the same result if I run my code from the windows command line.
This happens because you didn't protect your "procedural" part of the code from re-execution when your child processes are importing f.
They need to import f, because Windows doesn't support forking as start method for new processes (only spawn). A new Python process has to be started from scratch, f imported and this import will also trigger another Pool to be created in all child-processes ... and their child-processes and their child-processes...
To prevent this recursion, you have to insert an if __name__ == '__main__': -line between the upper part, which should run on imports and a lower part, which should only run when your script is executed as the main script (only the case for the parent).
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__': # protect your program's entry point
p = Pool(6)
print(p.map(f, range(10)))
Separating your code like that is mandatory for multiprocessing on Windows and Unix-y systems when used with 'spawn' or 'forkserver' start-method instead of default 'fork'. In general, start-methods can be modified with multiprocessing.set_start_method(method).
Since Python 3.8, macOS also uses 'spawn' instead of 'fork' as default.
It's general a good practice to separate any script in upper "definition" and lower "execution as main", to make code importable without unnecessary executions of parts only relevant when run as top level script. Last but not least, it facilitates understanding the control flow of your program when you don't intermix definitions and executions.

multiprocessing.Pool in jupyter notebook works on linux but not windows

I'm trying to run a few independent computations (though reading from the same data). My code works when I run it on Ubuntu, but not on Windows (windows server 2012 R2), where I get the error:
'module' object has no attribute ...
when I try to use multiprocessing.Pool (it appears in the kernel console, not as output in the notebook itself)
(And I've already made the mistake of defining the function AFTER creating the pool, and I've also corrected it, that's not the problem).
This happens even on the simplest of examples:
from multiprocessing import Pool
def f(x):
return x**2
pool = Pool(4)
for res in pool.map(f,range(20)):
print res
I know that it needs to be able to import the module (and I have no idea how this works when working in the notebook), and I've heard of IPython.Parallel, but I have been unable to find any documentation or examples.
Any solutions/alternatives would be most welcome.
I would post this as a comment since I don't have a full answer, but I'll amend as I figure out what is going on.
from multiprocessing import Pool
def f(x):
return x**2
if __name__ == '__main__':
pool = Pool(4)
for res in pool.map(f,range(20)):
print(res)
This works. I believe the answer to this question is here. In short, the subprocesses do not know they are subprocesses and are attempting to run the main script recursively.
This is the error I am given, which gives us the same solution:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.

Python multiprocessing hanging on pool.join()

I'm having problems with Python freezing when I try to use the multiprocessing module. I'm using Spyder 2.3.2 with Python 3.4.3 (I have previously encountered problems that were specific to iPython).
I've reduced it to the following MWE:
import multiprocessing
def test_function(arg1=1,arg2=2):
print("arg1 = {0}, arg2 = {1}".format(arg1,arg2))
return None
pool = multiprocessing.Pool(processes=3)
for i in range(6):
pool.apply_async(test_function)
pool.close()
pool.join()
This, in its current form, should just produce six identical iterations of test_function. However, while I can enter the commands with no hassle, when I give the command pool.join(), iPython hangs, and I have to restart the kernel.
The function works perfectly well when done in serial (the next step in my MWE would be to use pool.apply_async(test_function,kwds=entry).
for i in range(6):
test_function()
arg_list = [{'arg1':3,'arg2':4},{'arg1':5,'arg2':6},{'arg1':7,'arg2':8}]
for entry in arg_list:
test_function(**entry)
I have (occasionally, and I'm unable to reliably reproduce it) come across an error message of ZMQError: Address already in use, which led me to this bug report, but preceding my code with either multiprocessing.set_start_method('spawn') or multiprocessing.set_start_method('forkserver') doesn't seem to work.
Can anyone offer any help/advice? Thanks in advance if so.
#Anarkopsykotik is correct: you must use a main, and you can get it to print by returning a result to the main thread.
Here's a working example.
import multiprocessing
import os
def test_function(arg1=1,arg2=2):
string="arg1 = {0}, arg2 = {1}".format(arg1,arg2) +" from process id: "+ str(os.getpid())
return string
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=3)
for i in range(6):
result = pool.apply_async(test_function)
print(result.get(timeout=1))
pool.close()
pool.join()
Two things pop to my mind that might cause problems.
First, in the doc, there is a warning about using the interactive interpreter with multiprocessing module :
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers
Functionality within this package requires that the main module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the Pool examples will not work in the interactive interpreter.
Second: you might want to retrieve a string with your async function, and then display it from your main thread. I am not quite sure child threads have access to standard output, which might be locked to the main thread.

Categories

Resources