Python multiprocessing pool function not defined

Python multiprocessing pool function not defined - python

I need to implement a multiprocessing pool that utilizes arbitrary packages for calculations. For this, I'm using Python and joblib 0.9.0. This code is basically the structure I want.
import numpy as np
from joblib import pool
def someComputation(x):
return np.interp(x, [-1, 1], [-1, 1])
if __name__ == '__main__':
some_set_of_numbers = [-1,-0.5,0,0.5,1]
the_pool = pool.Pool(processes=2)
solutions = [the_pool.apply_async(someComputation, (x,)) for x in some_set_of_numbers]
print(solutions[0].get())
On both Windows 10 and Red Hat Enterprise Linux running Anaconda 4.3.1 Python 3.6.0 (as well as 3.5 and 3.4 with virtual envs), I get that 'np' was never passed into the someComputation() function raising the error
File "C:\Anaconda3\lib\site-packages\multiprocessing_on_dill\pool.py", line 608, in get
raise self._value
NameError: name 'np' is not defined
however, on my Mac OS X 10.11.6 running Python 3.5 and the same joblib, I get the expected output of '-1' with the exact same code. This question is essentially the same, but it dealt with pathos and not joblib. The general answer was to include the numpy import statement inside of the function
from joblib import pool
def someComputation(x):
import numpy as np
return np.interp(x, [-1, 1], [-1, 1])
if __name__ == '__main__':
some_set_of_numbers = [-1,-0.5,0,0.5,1]
the_pool = pool.Pool(processes=2)
solutions = [the_pool.apply_async(someComputation, (x,)) for x in some_set_of_numbers]
print(solutions[0].get())
This solves the issue on the Windows and Linux machines, where they now output '-1' as expected but this solution seems clunky. Is there any reason why the first bit of code would work on a Mac, but not Windows or Linux? I ultimately need to run this code on the Linux machine so is there any fix that doesn't include putting the import statement inside of the function?
Edit:
After investigating a bit further, I found an old workaround I put in years ago that looks like is causing the issue. In joblib/pool.py, I changed line 44 from
from multiprocessing.pool import Pool
to
from multiprocessing_on_dill.pool import Pool
to support pickling of arbitrary functions. For some reason, this change is what really causes the issue on Windows and Linux, but the Mac machine runs just fine. Using multiprocessing instead of multiprocessing_on_dill solves the above issue, but the code doesn't work for the majority of my cases since they can't be pickled.

I am not sure what the exact issue is, but it appears that there is some problem with transferring the global scope over to the subprocesses that run the task. You can potentially avoid name errors by binding the name np as a function parameter:
def someComputation(x, np=np):
return np.interp(x, [-1, 1], [-1, 1])
This has the advantage of not requiring a call to the import machinery every time the function is run. The name np will be bound to the function when it is first evaluated during module loading.

Related

Python3 not handling matplotlib plot when using a multiprocess pool

I have a small script creating different plots. Since no data are shared, I can do some multiprocessing. Using python2.7, no problem. With python3.6, I can´t seem to make it work.
I am using a pool (https://docs.python.org/3/library/multiprocessing.html and https://docs.python.org/2/library/multiprocessing.html) since I do not share objects or anything.
For Python3, I get a crash without traceback at line (fig = plt.figure(number)).
I am running on MacOs X sierra. I believe the problem is the same as for this topic (Saving multiple matplotlib figures with multiprocessing). Unfortunately, the problem wasn´t really addressed as not being the main issue.
One fast answer would be to use python2.7, but other pieces of my work rely on python3+ features.
Any idea on how to have traceback at least (verbose mode didn't show anything related to the crash), and then to solve this issue?
Many thanks
Here is the smallest code producing the error, coming from the thread mentioned above. (this code will create 4 files in the folder of the script).
import matplotlib.pyplot as plt
import numpy.random as random
from multiprocessing import Pool
def do_plot(number):
fig = plt.figure(number)
a = random.sample(100)
b = random.sample(100)
plt.scatter(a, b)
plt.savefig("%03d.jpg" % (number,))
plt.close()
print("Done ", number)
if __name__ == '__main__':
pool = Pool()
pool.map(do_plot, range(4))

python Spyder not importing numpy

I am writing a script using python Spyder 2.2.5 with Windows 7, python 2.7
At the very beginning I have tried all the import ways:
from numpy import *
or
import numpy
and also
import numpy as np
And, for each an every line where I use numpy I am getting an error when compiling
QR10 = numpy.array(QR10,dtype=float)
QR20 = numpy.array(QR20,dtype=float)
QR11 = numpy.array(QR11,dtype=float)
QR21 = numpy.array(QR21,dtype=float)
However, even with this 30 errors, the script works if I run it....
Any help about this?

Python cannot actually be compiled. Spyder performs just a static code analysis using Pylint. Depending on the version of Pylint that is being used, it could be a bug or an undetectable case for it.
For example, the import statement (or the path that gets to it) could be within a conditional block, which cannot be resolved until runtime. Given that you are using Spyder, it could also be that you put your import statement directly on the console, or in a separate file, and then use the imported module from the script.
You may try to see if you receive the same error with a script like the following:
import numpy
QR10 = [1, 2, 3]
QR20 = [1, 2, 3]
QR11 = [1, 2, 3]
QR21 = [1, 2, 3]
QR10 = numpy.array(QR10,dtype=float)
QR20 = numpy.array(QR20,dtype=float)
QR11 = numpy.array(QR11,dtype=float)
QR21 = numpy.array(QR21,dtype=float)
You should not see the E0602 here. Funny enough, however, you may receive [E1101] Module 'numpy' has no 'array' member, because it turns out that numpy does some dynamic definition of members, so Pylint cannot know about it (as you may see here) of a bug that has actually been solved already.
The moral of the story is that Pylint errors shouldn't keep you awake at night. It's good to see the report, but if you are sure that your code makes sense and it runs just right, you may just ignore them - although trying to know why it is giving an error is always a good exercise.

import numpy as np
then use
QR10 = np.array(QR10,dtype=float) # instead of numpy.array

Importing scipy breaks multiprocessing support in Python

I am running into a bizarre problem that I can't explain. I'm hoping someone out there can help please!
I'm running Python 2.7.3 and Scipy v0.14.0 and am trying to implement some very simple multiprocessor algorithms to speeds up my code using the module multiprocessing. I've managed to make a basic example work:
import multiprocessing
import numpy as np
import time
# import scipy.special
def compute_something(t):
a = 0.
for i in range(100000):
a = np.sqrt(t)
return a
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
print "Pool size:", pool_size
pool = multiprocessing.Pool(processes=pool_size)
inputs = range(10)
tic = time.time()
builtin_outputs = map(compute_something, inputs)
print 'Built-in:', time.time() - tic
tic = time.time()
pool_outputs = pool.map(compute_something, inputs)
print 'Pool :', time.time() - tic
This runs fine, returning
Pool size: 8
Built-in: 1.56904006004
Pool : 0.447728157043
But if I uncomment the line import scipy.special, I get:
Pool size: 8
Built-in: 1.58968091011
Pool : 1.59387993813
and I can see that only one core is doing the work on my system. In fact, importing any module from the scipy package seems to have this effect (I've tried several).
Any ideas? I've never seen a case like this before, where an apparently innocuous import can have such a strange and unexpected effect.
Thanks!
Update (1)
Moving the scipy import line to the function compute_something partially improves the problem:
Pool size: 8
Built-in: 1.66807389259
Pool : 0.596321105957
Update (2)
Thanks to #larsmans for testing on a different system. Problem was not confirmed using Scipy v.0.12.0. Moving this query to the scipy mailing list and will post any answers.

After much digging around and posting an issue on the Scipy GitHub site, I've found a solution.
Before I start, this is documented very well here - I'll just give an overview.
This problem is not related to the version of Scipy, or Numpy that I was using. It originates in the system BLAS libraries that Numpy and Scipy use for various linear algebra routines. You can tell which libraries Numpy is linked to by running
python -c 'import numpy; numpy.show_config()'
If you are using OpenBLAS in Linux, you may find that the CPU affinity is set to 1, meaning that once these algorithms are imported in Python (via Numpy/Scipy), you can access at most one core of the CPU. To test this, within a Python terminal run
import os
os.system('taskset -p %s' %os.getpid())
If the CPU affinity is returned as f, of ff, you can access multiple cores. In my case it would start like that, but upon importing numpy or scipy.any_module, it would switch to 1, hence my problem.
I've found two solutions:
Change CPU affinity
You can manually set the CPU affinity of the master process at the top of the main function so that the code looks like this:
import multiprocessing
import numpy as np
import math
import time
import os
def compute_something(t):
a = 0.
for i in range(10000000):
a = math.sqrt(t)
return a
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
os.system('taskset -cp 0-%d %s' % (pool_size, os.getpid()))
print "Pool size:", pool_size
pool = multiprocessing.Pool(processes=pool_size)
inputs = range(10)
tic = time.time()
builtin_outputs = map(compute_something, inputs)
print 'Built-in:', time.time() - tic
tic = time.time()
pool_outputs = pool.map(compute_something, inputs)
print 'Pool :', time.time() - tic
Note that selecting a value higher than the number of cores for taskset doesn't seem to matter - it just uses the maximum possible number.
Switch BLAS libraries
Solution documented at the site linked above. Basically: install libatlas and run update-alternatives to point numpy to ATLAS rather than OpenBLAS.

Python Mlab - cannot import name find_available_releases

I am new to Python. I am trying to run MATLAB from inside Python using the mlab package. I was following the guide on the website, and I entered this in the Python command line:
from mlab.releases import latest_release
The error I got was:
cannot import name find_available_releases
It seems that under matlabcom.py there was no find_available_releases function.
May I know if anyone knows how to resolve this? Thank you!
PS: I am using Windows 7, MATLAB 2012a and Python 2.7

I skimmed through the code, and I don't think all of the README file and its documentation match what's actually implemented. It appears to be mostly copied from the original mlabwrap project.
This is confusing because mlabwrap is implemented using a C extension module to interact with the MATLAB Engine API. However the mlab code seems to have replaced that part with a pure Python implementation as the MATLAB-bridge backend. It comes from "Dana Pe'er Lab" and it uses two different methods to interact with MATLAB depending on the platform (COM/ActiveX on Windows and pipes on Linux/Mac).
Now that we understand how the backend is implemented, you can start looking at the import error.
Note: the Linux/Mac part of the code tries to find the MATLAB executable in some hardcoded fixed locations, and allows to choose between different versions.
However you are working on Windows, and the code doesn't really implement any way of picking between MATLAB releases for this platform (so all of the methods like discover_location and find_available_releases are useless on Windows). In the end, the COM object is created as:
self.client = win32com.client.Dispatch('matlab.application')
As explained here, the ProgID matlab.application is not version-specific, and will simply use whatever was registered as the default MATLAB COM server. We can explicitly specify what MATLAB version we want (assuming you have multiple installations), for instance matlab.application.8.3 will pick MATLAB R2014a.
So to fix the code, IMO the easiest way would be to get rid of all that logic about multiple MATLAB versions (in the Windows part of the code), and just let it create the MATLAB COM object as is. I haven't attempted it, but I don't think it's too involved... Good luck!
EDIT:
I download the module and I managed to get it to work on Windows (I'm using Python 2.7.6 and MATLAB R2014a). Here are the changes:
$ git diff
diff --git a/src/mlab/matlabcom.py b/src/mlab/matlabcom.py
index 93f075c..da1c6fa 100644
--- a/src/mlab/matlabcom.py
+++ b/src/mlab/matlabcom.py
## -21,6 +21,11 ## except:
print 'win32com in missing, please install it'
raise
+def find_available_releases():
+ # report we have all versions
+ return [('R%d%s' % (y,v), '')
+ for y in range(2006,2015) for v in ('a','b')]
+
def discover_location(matlab_release):
pass
## -62,7 +67,7 ## class MatlabCom(object):
"""
self._check_open()
try:
- self.eval('quit();')
+ pass #self.eval('quit();')
except:
pass
del self.client
diff --git a/src/mlab/mlabraw.py b/src/mlab/mlabraw.py
index 3471362..16e0e2b 100644
--- a/src/mlab/mlabraw.py
+++ b/src/mlab/mlabraw.py
## -42,6 +42,7 ## def open():
if is_win:
ret = MatlabConnection()
ret.open()
+ return ret
else:
if settings.MATLAB_PATH != 'guess':
matlab_path = settings.MATLAB_PATH + '/bin/matlab'
diff --git a/src/mlab/releases.py b/src/mlab/releases.py
index d792b12..9d6cf5d 100644
--- a/src/mlab/releases.py
+++ b/src/mlab/releases.py
## -88,7 +88,7 ## class MatlabVersions(dict):
# Make it a module
sys.modules['mlab.releases.' + matlab_release] = instance
sys.modules['matlab'] = instance
- return MlabWrap()
+ return instance
def pick_latest_release(self):
return get_latest_release(self._available_releases)
First I added the missing find_available_releases function. I made it so that it reports that all MATLAB versions are available (like I explained above, it doesn't really matter because of the way the COM object is created). An even better fix would be to detect the installed/registered MATLAB versions using the Windows registry (check the keys HKCR\Matlab.Application.X.Y and follow their CLSID in HKCR\CLSID). That way you can truly choose and pick which version to run.
I also fixed two unrelated bugs (one where the author forgot the function return value, and the other unnecessarily creating the wrapper object twice).
Note: During testing, it might be faster NOT to start/shutdown a MATLAB instance each time the script is called. This is why I commented self.eval('quit();') in the close function. That way you can start MATLAB using matlab.exe -automation (do this only once), and then repeatedly re-use the session without shutting it down. Just kill the process when you're done :)
Here is a Python example to test the module (I also show a comparison against NumPy/SciPy/Matplotlib):
test_mlab.py
# could be anything from: latest_release, R2014b, ..., R2006a
# makes no difference :)
from mlab.releases import R2014a as matlab
# show MATLAB version
print "MATLAB version: ", matlab.version()
print matlab.matlabroot()
# compute SVD of a NumPy array
import numpy as np
A = np.random.rand(5, 5)
U, S, V = matlab.svd(A, nout=3)
print "S = \n", matlab.diag(S)
# compare MATLAB's SVD against Scipy's SVD
U, S, V = np.linalg.svd(A)
print S
# 3d plot in MATLAB
X, Y, Z = matlab.peaks(nout=3)
matlab.figure(1)
matlab.surf(X, Y, Z)
matlab.title('Peaks')
matlab.xlabel('X')
matlab.ylabel('Y')
matlab.zlabel('Z')
# compare against matplotlib surface plot
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap='jet')
ax.view_init(30.0, 232.5)
plt.title('Peaks')
plt.xlabel('X')
plt.ylabel('Y')
ax.set_zlabel('Z')
plt.show()
Here is the output I get:
C:\>python test_mlab.py
MATLAB version: 8.3.0.532 (R2014a)
C:\Program Files\MATLAB\R2014a
S =
[[ 2.41632007]
[ 0.78527851]
[ 0.44582117]
[ 0.29086795]
[ 0.00552422]]
[ 2.41632007 0.78527851 0.44582117 0.29086795 0.00552422]
EDIT2:
The above changes have been accepted and merged into mlab.

You are right in saying that the find_available_releases() is not written. 2 ways to work this out
Check out the code in linux and work on it (You are working on
windows !)
Change the Code as below
Add the following function in matlabcom.py as in matlabpipe.py
def find_available_releases():
global _RELEASES
if not _RELEASES:
_RELEASES = list(_list_releases())
return _RELEASES
If you see mlabraw.py file, the following code will give you a clear idea why I am saying this !
import sys
is_win = 'win' in sys.platform
if is_win:
from matlabcom import MatlabCom as MatlabConnection
from matlabcom import MatlabError as error
from matlabcom import discover_location, find_available_releases
from matlabcom import WindowsMatlabReleaseNotFound as MatlabReleaseNotFound
else:
from matlabpipe import MatlabPipe as MatlabConnection
from matlabpipe import MatlabError as error
from matlabpipe import discover_location, find_available_releases
from matlabpipe import UnixMatlabReleaseNotFound as MatlabReleaseNotFound

Python rpy2 and matplotlib conflict when using multiprocessing

I am trying to calculate and generate plots using multiprocessing. On Linux the code below runs correctly, however on the Mac (ML) it doesn't, giving the error below:
import multiprocessing
import matplotlib.pyplot as plt
import numpy as np
import rpy2.robjects as robjects
def main():
pool = multiprocessing.Pool()
num_figs = 2
# generate some random numbers
input = zip(np.random.randint(10,1000,num_figs),
range(num_figs))
pool.map(plot, input)
def plot(args):
num, i = args
fig = plt.figure()
data = np.random.randn(num).cumsum()
plt.plot(data)
main()
The Rpy2 is rpy2==2.3.1 and R is 2.13.2 (I could not install R 3.0 and rpy2 latest version on any mac without getting segmentation fault).
The error is:
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
I have tried everything to understand what the problem is with no luck. My configuration is:
Danials-MacBook-Pro:~ danialt$ brew --config
HOMEBREW_VERSION: 0.9.4
ORIGIN: https://github.com/mxcl/homebrew
HEAD: 705b5e133d8334cae66710fac1c14ed8f8713d6b
HOMEBREW_PREFIX: /usr/local
HOMEBREW_CELLAR: /usr/local/Cellar
CPU: dual-core 64-bit penryn
OS X: 10.8.3-x86_64
Xcode: 4.6.2
CLT: 4.6.0.0.1.1365549073
GCC-4.2: build 5666
LLVM-GCC: build 2336
Clang: 4.2 build 425
X11: 2.7.4 => /opt/X11
System Ruby: 1.8.7-358
Perl: /usr/bin/perl
Python: /usr/local/bin/python => /usr/local/Cellar/python/2.7.4/Frameworks/Python.framework/Versions/2.7/bin/python2.7
Ruby: /usr/bin/ruby => /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby
Any ideas?

This error occurs on Mac OS X when you perform a GUI operation outside the main thread, which is exactly what you are doing by shifting your plot function to the multiprocessing.Pool (I imagine that it will not work on Windows either for the same reason - since Windows has the same requirement). The only way that I can imagine it working is using the pool to generate the data, then have your main thread wait in a loop for the data that's returned (a queue is the way I usually handle it...).
Here is an example (recognizing that this may not do what you want - plot all the figures "simultaneously"? - plt.show() blocks so only one is drawn at a time and I note that you do not have it in your sample code - but without I don't see anything on my screen - however, if I take it out - there is no blocking and no error because all GUI functions are happening in the main thread):
import multiprocessing
import matplotlib.pyplot as plt
import numpy as np
import rpy2.robjects as robjects
data_queue = multiprocessing.Queue()
def main():
pool = multiprocessing.Pool()
num_figs = 10
# generate some random numbers
input = zip(np.random.randint(10,10000,num_figs), range(num_figs))
pool.map(worker, input)
figs_complete = 0
while figs_complete < num_figs:
data = data_queue.get()
plt.figure()
plt.plot(data)
plt.show()
figs_complete += 1
def worker(args):
num, i = args
data = np.random.randn(num).cumsum()
data_queue.put(data)
print('done ',i)
main()
Hope this helps.

I had a similar issue with my worker, which was loading some data, generating a plot, and saving it to a file. Note that this is slightly different than what the OP's case, which seems to be oriented around interactive plotting. Still, I think it's relevant.
A simplified version of my code:
def worker(id):
data = load_data(id)
plot_data_to_file(data) # Generates a plot and saves it to a file.
def plot_something_parallel(ids):
pool = multiprocessing.Pool()
pool.map(worker, ids)
plot_something_parallel(ids=[1,2,3])
This caused the same error others mention:
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
Following #bbbruce's train of thought, I solved my problem by switching the matplotlib backend from TKAgg to the default. Specifically, I commented out the following line in my matplotlibrc file:
#backend : TkAgg

This might be rpy2-specific.
There are reports of a similar problem with OS X and multiprocessing here and there.
I think that using an initializer that imports the packages needed to run the code in plot could solve the problem (multiprocessing-doc).

I had a similar issue and found that setting the start method in multiprocessing to use forkserver works as long as it comes after your if name == main: statement.
if __name__ == '__main__':
multiprocessing.set_start_method('forkserver')
first_process = multiprocessing.Process(target = targetOne)
second_process = multiprocessing.Process(target = targetTwo)
first_process.start()
second_process.start()

Try to upgrade matplotlib to 3.0.3:
pip3 install matplotlib --upgrade
Then everything goes fine.
=======================================================================
No need to read below anymore.
Yesterday, my multiprocess plot works on my MacBook Air. But not working on my MacBook Pro tomorrow morning with the same code, displaying many:
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
They are all using 4th gen i intel CPU (i5-4xxx with air and i7-4xxx with pro). So if there are no difference on hardware, it must be on software.
So I just tried update matplot to 3.0.3 on MacBook Pro( was 3.0.1), every thing goes fine.
Also, no need to do pool.apply_async anymore.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python multiprocessing pool function not defined - python

Related

Python3 not handling matplotlib plot when using a multiprocess pool

python Spyder not importing numpy

Importing scipy breaks multiprocessing support in Python

Python Mlab - cannot import name find_available_releases

Python rpy2 and matplotlib conflict when using multiprocessing

Categories

Resources