I'm having problems with Python freezing when I try to use the multiprocessing module. I'm using Spyder 2.3.2 with Python 3.4.3 (I have previously encountered problems that were specific to iPython).
I've reduced it to the following MWE:
import multiprocessing
def test_function(arg1=1,arg2=2):
print("arg1 = {0}, arg2 = {1}".format(arg1,arg2))
return None
pool = multiprocessing.Pool(processes=3)
for i in range(6):
pool.apply_async(test_function)
pool.close()
pool.join()
This, in its current form, should just produce six identical iterations of test_function. However, while I can enter the commands with no hassle, when I give the command pool.join(), iPython hangs, and I have to restart the kernel.
The function works perfectly well when done in serial (the next step in my MWE would be to use pool.apply_async(test_function,kwds=entry).
for i in range(6):
test_function()
arg_list = [{'arg1':3,'arg2':4},{'arg1':5,'arg2':6},{'arg1':7,'arg2':8}]
for entry in arg_list:
test_function(**entry)
I have (occasionally, and I'm unable to reliably reproduce it) come across an error message of ZMQError: Address already in use, which led me to this bug report, but preceding my code with either multiprocessing.set_start_method('spawn') or multiprocessing.set_start_method('forkserver') doesn't seem to work.
Can anyone offer any help/advice? Thanks in advance if so.
#Anarkopsykotik is correct: you must use a main, and you can get it to print by returning a result to the main thread.
Here's a working example.
import multiprocessing
import os
def test_function(arg1=1,arg2=2):
string="arg1 = {0}, arg2 = {1}".format(arg1,arg2) +" from process id: "+ str(os.getpid())
return string
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=3)
for i in range(6):
result = pool.apply_async(test_function)
print(result.get(timeout=1))
pool.close()
pool.join()
Two things pop to my mind that might cause problems.
First, in the doc, there is a warning about using the interactive interpreter with multiprocessing module :
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers
Functionality within this package requires that the main module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the Pool examples will not work in the interactive interpreter.
Second: you might want to retrieve a string with your async function, and then display it from your main thread. I am not quite sure child threads have access to standard output, which might be locked to the main thread.
Related
I have a big code that take a while to make calculation, I have decided to learn about multithreading and multiprocessing because only 20% of my processor was being used to make the calculation. After not having any improvement with multithreading, I have decided to try multiprocessing and whenever I try to use it, it just show a lot of errors even on a very simple code.
this is the code that I tested after starting having problems with my big calculation heavy code :
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
main()
and in the error message that I amhaving it says
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
this is not the whole message because it is very big but I think that I may be helpful in order to help me. Pretty much everything else on the error message is just like "error at line ... in ..."
If it may be helpful the big code is at : https://github.com/nobody48sheldor/fuseeinator2.0
it might not be the latest version.
I updated your code to show main being called. This is an issue with spawning operating systems like Windows. To test on my linux machine I had to add a bit of code. But this crashes on my machine:
# Test code to make linux spawn like Windows and generate error. This code
# # is not needed on windows.
if __name__ == "__main__":
import multiprocessing as mp
mp.freeze_support()
mp.set_start_method('spawn')
# test script
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
main()
In a spawning system, python can't just fork into a new execution context. Instead, it runs a new instance of the python interpreter, imports the module and pickles/unpickles enough state to make a child execution environment. This can be a very heavy operation.
But your script is not import safe. Since main() is called at module level, the import in the child would run main again. That would create a grandchild subprocess which runs main again (and etc until you hang your machine). Python detects this infinite loop and displays the message instead.
Top level scripts are always called "__main__". Put all of the code that should only be run once at the script level inside an if. If the module is imported, nothing harmful is run.
if __name__ == "__main__":
main()
and the script will work.
There are code analyzers out there that import modules to extract doc strings, or other useful stuff. Your code shouldn't fire the missiles just because some tool did an import.
Another way to solve the problem is to move everything multiprocessing related out of the script and into a module. Suppose I had a module with your code in it
whatever.py
from concurrent.futures import ProcessPoolExecutor
def func():
print("done")
def func_():
print("done")
def main():
executor = ProcessPoolExecutor(max_workers=3)
p1 = executor.submit(func)
p2 = executor.submit(func_)
myscript.py
#!/usr/bin/env pythnon3
import whatever
whatever.main()
Now, since the pool is laready in an imported module that doesn't do this crazy restart-itself thing, no if __name__ == "__main__": is necessary. Its a good idea to put it in myscript.py anyway, but not required.
I have a mac (MacOs 10.15.4, Python ver 3.82) and need to work in multiprocessing, but on my pc the procedures doesn’t work.
For example, I have copied a simple parallel python program
import multiprocessing as mp
import time
def test_function(i):
print("function starts" + str(i))
time.sleep(1)
print("function ends" + str(i))
if __name__ == '__main__':
pool = mp.Pool(mp.cpu_count())
pool.map(test_function, [i for i in range(4)])
pool.close()
pool.join()
What I expect to see in the output:
function starts0
function ends0
function starts1
function ends1
function starts2
function ends2
function starts3
function ends3
Or similar...
What I actually see:
= RESTART: /Users/Simulazioni/prova.py
>>>
Just nothing, no errors and no informations, just nothing. I have already try mamy procedure without results. The main problem, I could see, is the call of the function, in fact the instruction:
if __name__ == '__main__':
doesn’t call the function,
def test_function(i):
I tried many example of that kind without results.
Is ti possible and/or what's is the easiest way to parallelize in macOs?
I know this is a bit of an old question, but I just faced the same issue and I solved it by using a different multithreading package's version, which is:
import multiprocess
instead of:
import multiprocessing
Reference:
https://pypi.org/project/multiprocess/
There seems to be a litany of questions and answers on overflow about the multiprocessing library. I have looked through all the relevant ones I can find all and have not found one that directly speaks to my problem.
I am trying to apply the same function to multiple files in parallel. Whenever I start the processing though, the computer just spins up several instances of python and then does nothing. No computations happen at all and the processes just sit idle
I have looked at all of the similar questions on overflow, and none seem to have my problem of idle processes.
what am i doing wrong?
define the function (abbreviated for example. checked to make sure it works)
import pandas as pd
import numpy as np
import glob
import os
#from timeit import default_timer as timer
import talib
from multiprocessing import Process
def example_function(file):
df=pd.read_csv(file, header = 1)
stock_name = os.path.basename(file)[:-4]
macd, macdsignal, macdhist = talib.MACD(df.Close, fastperiod=12, slowperiod=26, signalperiod=9)
df['macd'] = macdhist*1000
print(f'stock{stock_name} processed')
final_macd_report.append(df)
getting a list of all the files in the directory i want to run the function on
import glob
path = r'C:\Users\josiahh\Desktop\big_test3/*'
files = [f for f in glob.glob(path, recursive=True)]
attempting multiprocessing
import multiprocessing as mp
if __name__ == '__main__':
p = mp.Pool(processes = 5)
async_result = p.map_async(example_function, files)
p.close()
p.join()
print("Complete")
any help would be greatly appreciated.
There's nothing wrong with the structure of the code, so something is going wrong that can't be guessed from what you posted. Start with something very much simpler, then move it in stages to what you're actually trying to do. You're importing mountains of extension (3rd party) code, and the problem could be anywhere. Here's a start:
def example_function(arg):
from time import sleep
msg = "crunching " + str(arg)
print(msg)
sleep(arg)
print("done " + msg)
if __name__ == '__main__':
import multiprocessing as mp
p = mp.Pool(processes = 5)
async_result = p.map_async(example_function, reversed(range(15)))
print("result", async_result.get())
p.close()
p.join()
print("Complete")
That works fine on Win10 under 64-bit Python 3.7.4 for me. Does it for you?
Note especially the async_result.get() at the end. That displays a list with 15 None values. You never do anything with your async_result. Because of that, if any exception was raised in a worker process, it will most likely silently vanish. In such cases .get()'ing the result will (re)raise the exception in your main program.
Also please verify that your files list isn't in fact empty. We can't guess at that from here either ;-)
EDIT
I moved the async_result.get() into its own line, right after the map_async(), to maximize the chance of revealing otherwise silent exception in the worker processes. At least add that much to your code too.
While I don't see anything wrong per se, I would like to suggest some changes.
In general, worker functions in a Pool are expected to return something. This return value is transferred back to the parent process. I like to use that as a status report. It is also a good idea to catch exceptions in the worker process, just in case.
For example:
def example_function(file):
status = 'OK'
try:
df=pd.read_csv(file, header = 1)
stock_name = os.path.basename(file)[:-4]
macd, macdsignal, macdhist = talib.MACD(df.Close, fastperiod=12, slowperiod=26, signalperiod=9)
df['macd'] = macdhist*1000
final_macd_report.append(df)
except:
status = 'exception caught!'
return {'filename': file, 'result': status}
(This is just a quick example. You might want to e.g. report the full exception traceback to help with debugging.)
If workers run for a long time, I like to get feedback ASAP.
So I prefer to use imap_unordered, especially if some tasks can take much longer than others. This returns an iterator that yields results in the order that jobs finish.
if __name__ == '__main__':
with mp.Pool() as p:
for res in p.imap_unordered(example_function, files):
print(res)
This way you get unambiguous proof that a worker finished, and what the result was and if any problems occurred.
This is preferable over just calling print from the workers. With stdout buffering and multiple workers inheriting the same output stream there is no saying when you actually see something.
Edit: As you can see here, multiprocessing.Pool does not work well with interactive interpreters, especially on ms-windows. Basically, ms-windows lacks the fork system call that lets UNIX-like systems duplicate a process. So on ms-windows, multiprocessing has to do a try and mimic fork which means importing the original program file in the child processes. That doesn't work well with interactive interpreters like IPython. One would probably have to dig deep into the internals of Jupyter and multiprocessing to find out the exact cause of the problem.
It seems that a workaround for this problem is to define the worker function in a separate module and import that in your code in IPython.
It is actually mentioned in the documentation that multiprocessing.Pool doesn't work well with interactive interpreters. See the note at the end of this section.
I am trying to use the python multiprocessing library in order to parallize a task I am working on:
import multiprocessing as MP
def myFunction((x,y,z)):
...create a sqlite3 database specific to x,y,z
...write to the database (one DB per process)
y = 'somestring'
z = <large read-only global dictionary to be shared>
jobs = []
for x in X:
jobs.append((x,y,z,))
pool = MP.Pool(processes=16)
pool.map(myFunction,jobs)
pool.close()
pool.join()
Sixteen processes are started as seen in htop, however no errors are returned, no files written, no CPU is used.
Could it happen that there is an error in myFunction that is not reported to STDOUT and blocks execution?
Perhaps it is relevant that the python script is called from a bash script running in background.
The lesson learned here was to follow the strategy suggested in one of the comments and use multiprocessing.dummy until everything works.
At least in my case, errors were not visible otherwise and the processes were still running as if nothing had happened.
I'm trying to run a few independent computations (though reading from the same data). My code works when I run it on Ubuntu, but not on Windows (windows server 2012 R2), where I get the error:
'module' object has no attribute ...
when I try to use multiprocessing.Pool (it appears in the kernel console, not as output in the notebook itself)
(And I've already made the mistake of defining the function AFTER creating the pool, and I've also corrected it, that's not the problem).
This happens even on the simplest of examples:
from multiprocessing import Pool
def f(x):
return x**2
pool = Pool(4)
for res in pool.map(f,range(20)):
print res
I know that it needs to be able to import the module (and I have no idea how this works when working in the notebook), and I've heard of IPython.Parallel, but I have been unable to find any documentation or examples.
Any solutions/alternatives would be most welcome.
I would post this as a comment since I don't have a full answer, but I'll amend as I figure out what is going on.
from multiprocessing import Pool
def f(x):
return x**2
if __name__ == '__main__':
pool = Pool(4)
for res in pool.map(f,range(20)):
print(res)
This works. I believe the answer to this question is here. In short, the subprocesses do not know they are subprocesses and are attempting to run the main script recursively.
This is the error I am given, which gives us the same solution:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.