I am running Spark programs on a large cluster (for which, I do not have administrative privileges). numpy is not installed on the worker nodes. Hence, I bundled numpy with my program, but I get the following error:
Traceback (most recent call last):
File "/home/user/spark-script.py", line 12, in <module>
import numpy
File "/usr/local/lib/python2.7/dist-packages/numpy/__init__.py", line 170, in <module>
File "/usr/local/lib/python2.7/dist-packages/numpy/add_newdocs.py", line 13, in <module>
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/__init__.py", line 8, in <module>
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/type_check.py", line 11, in <module>
File "/usr/local/lib/python2.7/dist-packages/numpy/core/__init__.py", line 6, in <module>
ImportError: cannot import name multiarray
The script is actually quite simple:
from pyspark import SparkConf, SparkContext
sc = SparkContext()
sc.addPyFile('numpy.zip')
import numpy
a = sc.parallelize(numpy.array([12, 23, 34, 45, 56, 67, 78, 89, 90]))
print a.collect()
I understand that the error occurs because numpy dynamically loads multiarray.so dependency and even if my numpy.zip file includes multiarray.so file, somehow the dynamic loading doesn't work with Apache Spark. Why so? And how do you othewise create a standalone numpy module with static linking?
Thanks.
There are at least two problems with your approach and both can be reduced to a simple fact that NumPy is a heavyweight dependency.
First of all Debian packages come with multiple dependencies including libgfortran, libblas, liblapack and libquadmath. So you cannot simply copy NumPy installation and expect that things will work (to be honest you shouldn't do anything like this if it wasn't the case). Theoretically you could try to build it using static linking and this way ship it with all the dependencies but it hits the second issue.
NumPy is pretty large by itself. While 20MB doesn't look particularly impressive and with all the dependencies it shouldn't be more 40MB it has to be shipped to the workers each time you start your job. The more workers you have the worse it gets. If you decide you need SciPy or SciKit it can get much worse.
Arguably this makes NumPy a really bad candidate for being shipped with pyFile method.
If you hadn't have direct access to the workers but all the dependencies, including header files and a static library were present, you could simply try to install NumPy in the user space from the task itself (it assumes that pip is installed as well) with something like this:
try:
import numpy as np
expect ImportError:
import pip
pip.main(["install", "--user", "numpy"])
import numpy as np
You'll find other variants of this method in How to install and import Python modules at runtime?
Since you have access to the workers a much better solution is to create a separate Python environment. Probably the simplest approach is to use Anaconda which can be used to package non-Python dependencies as well and doesn't depend on the system-wide libraries. You can easily automate this task using tools like Ansible or Fabric, it doesn't require administrative privileges and all you really need is bash and some way to fetch basic installers (wget, curl, rsync, scp).
See also: shipping python modules in pyspark to other nodes?
Related
I'm attempting to figure out the mechanics of running Python scripts in Power BI for Reasons and I've hit a snag. I was running through the steps in this somewhat basic tutorial and I came to the section in which I am supposed to paste the script into they Python Script screen, which is Step 3 of the 'Run the Script and Import Data' section.
When I followed the steps, which are essentially to paste the example script into the window and hit Okay, I got this 'helpful' error:
`
Details: "ADO.NET: Python script error.
<pi>Traceback (most recent call last):
File "C:\Users\my.username\PythonScriptWrapper_aca634c7-30c3-4bf4-881b-d1e47bb0a919\PythonScriptWrapper.PY", line 2, in <module>
import os, pandas, matplotlib
File "C:\Users\my.username\Anaconda3\lib\site-packages\matplotlib\__init__.py", line 109, in <module>
from . import _api, _version, cbook, docstring, rcsetup
File "C:\Users\my.username\Anaconda3\lib\site-packages\matplotlib\rcsetup.py", line 27, in <module>
from matplotlib.colors import Colormap, is_color_like
File "C:\Users\my.username\Anaconda3\lib\site-packages\matplotlib\colors.py", line 51, in <module>
from PIL import Image
File "C:\Users\my.username\Anaconda3\lib\site-packages\PIL\Image.py", line 89, in <module>
from . import _imaging as core
ImportError: DLL load failed while importing _imaging: The specified module could not be found.
</pi>"
`
I have verified that the matplotlib and pandas packages were installed via pip list, os doesn't show up, which surprised me but I know it's part of the standard library so I'm not stressing about it unless someone thinks I should. Is anyone an expert in this? Is there a better way? Am I doomed to scream into the Power BI void for all eternity?
The somewhat basic tutorial is based on using standard python from python.org. However, you are using the Anaconda distribution, which basically requires the environment to be activated before any modules - especially pandas' C-library - can be accessed.
You can achieve that in the cmd shell by running
conda activate
C:\Users\my.username\AppData\Local\Microsoft\WindowsApps\PBIDesktopStore.exe
which assumes that you are using the Power BI Desktop version from the Microsoft store.
I have two libraries: Pandas and utils (my library), and I want to import in my code. Since I was testing Pandas does not work as well.
Using boto3 and requests (without being preinstalled in the cluster) it works creating two zip files:
libs.zip: with boto3 and requests
dependencies.zip: utils
So, I import Pandas using a requirements file and creating a zip with all Pandas dependencies. I've tried importing the zip file within the code, like:
sc.addPyFile("libs.zip")
and the spark submit is like:
spark-submit --deploy-mode client --py-files s3://${BUCKET_NAME}/libs.zip s3://${BUCKET_NAME}/main.py
I tried a lot to submit a spark job in EMR cluster and I don't have any idea about this issue:
Traceback (most recent call last):
File "/mnt/tmp/spark-xxxx/main.py", line 20, in <module>
import pandas as pd
File "/mnt/tmp/spark-xxxx/userFiles-xxxx/libs.zip/pandas/__init__.py", line 17, in <module>
ImportError: Unable to import required dependencies:
numpy:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
* The Python version is: Python3.7 from "/usr/bin/python3"
* The NumPy version is: "1.19.4"
and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.
Original error was: No module named 'numpy.core._multiarray_umath'
How can I import Pandas and another library (created by me) in spark submit.
I am trying to run a script which involves numpy through the terminal on WinSCP, but whenever I do, I get the following error:
import gensim
File "/data/work/worker/gensim/init.py", line 5, in
from gensim import parsing, corpora, matutils, interfaces, models, similarities, summarization, utils # noqa:F401
File "/data/work/worker/gensim/parsing/init.py", line 4, in
from .preprocessing import (remove_stopwords, strip_punctuation, strip_punctuation2, # noqa:F401
File "/data/work/worker/gensim/parsing/preprocessing.py", line 42, in
from gensim import utils
File "/data/work/worker/gensim/utils.py", line 38, in
import numpy as np
File "/data/work/worker/numpy/init.py", line 142, in
from . import core
File "/data/work/worker/numpy/core/init.py", line 50, in
raise ImportError(msg)
ImportError:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
The Python version is: Python2.7 from "/usr/bin/python"
The NumPy version is: "1.18.5"
and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.
Original error was: No module named _multiarray_umath>
I suspect that the issue is that the version of Python being quoted here is outdated; however, after having done some searching, I cannot find any literature on how to update Python on WinSCP. I have Python 3.8 installed on my machine, and I have tried moving the installer and the .exe file into the WinSCP directory to no avail. Is there any way to update python directly in the terminal? Alternately, is this issue actually nothing to do with a stale version of Python at all?
I've managed to construct a simple app utilizing the Pico framework (https://github.com/fergalwalsh/pico). My frontend is connecting to my backend without any difficulties. Below is my Python file, which at the moment simply returns/renders a string, using a client-side input value, "name".
from __future__ import absolute_import
import sys
import pico
import numpy as np
# import sklearn
# import pandas as pd
from api2 import aloha
from pico import PicoApp
#pico.expose()
def hello(name):
a = np.arange(15).reshape(3, 5)
# a = np.arrange('data', 'field').reshape(3,5)
return "hello %s, %s" %(name, a)
app = PicoApp()
app.register_module(__name__)
(It also returns a NumPy array, simply because I'm testing what I can import into the file.)
All my packages are installed just fine, via Anaconda in /site-packages, which is in the python3.6 directory.
Oddly, the app runs fine; it can import NumPy. It breaks, however, when I try to import Pandas or SKLearn. I've tried manually copying and pasting NumPy into /Library/Python/2.7/site-packages, which actually breaks the app. But NumPy works in the app when it is only located in Anaconda's /site-packages.
I've tried altering app.register(__name__) to app.register('api'), which is the name of the Python file (api.py), based on another Question/Answer here. I've also tried reinstalling Pandas with sudo -H pip install pandas, but all the requirements are already satisfied.
This is the error that is thrown when I try to include Pandas in api.py:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Library/Python/2.7/site-packages/pico/server.py", line 31, in <module>
app = import_string(module_name)
File "/Library/Python/2.7/site-packages/werkzeug/utils.py", line 443, in import_string
sys.exc_info()[2])
File "/Library/Python/2.7/site-packages/werkzeug/utils.py", line 431, in import_string
module = import_string(module_name)
File "/Library/Python/2.7/site-packages/werkzeug/utils.py", line 443, in import_string
sys.exc_info()[2])
File "/Library/Python/2.7/site-packages/werkzeug/utils.py", line 418, in import_string
__import__(import_name)
File "./api.py", line 6, in <module>
import pandas as pd
File "/Library/Python/2.7/site-packages/pandas/__init__.py", line 23, in <module>
from pandas.compat.numpy import *
File "/Library/Python/2.7/site-packages/pandas/compat/numpy/__init__.py", line 24, in <module>
'this pandas version'.format(_np_version))
werkzeug.utils.ImportStringError: import_string() failed for 'api.app'. Possible reasons are:
- missing __init__.py in a package;
- package or module path not included in sys.path;
- duplicated package or module name taking precedence in sys.path;
- missing module, class, function or variable;
Debugged import:
- 'api' not found.
Original exception:
ImportStringError: import_string() failed for 'api'. Possible reasons are:
- missing __init__.py in a package;
- package or module path not included in sys.path;
- duplicated package or module name taking precedence in sys.path;
- missing module, class, function or variable;
Debugged import:
- 'api' not found.
Original exception:
ImportError: this version of pandas is incompatible with numpy < 1.9.0
your numpy version is 1.8.0rc1.
Please upgrade numpy to >= 1.9.0 to use this pandas version
When I run which python, it points to /Users/richardscheiwe/anaconda3/bin/python. Also, I have NumPy v.1.15 installed, and I can't find any other NumPy folder(s). When I try moving a version of NumPy to Library/Python/2.7/site-packages, I get this error:
ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control). Otherwise reinstall numpy.
Original error was: cannot import name multiarray
I guess I need to somehow point the app's Python to Anaconda's Python 3.6 version, but I don't know how to do that. Pico is also available in Anaconda's /site-packages directory, but it isn't pointing there.
Any help is greatly appreciated. I've scoured StackOverflow and GitHub.
You don't mention how you are starting the pico app but I assume you are doing like this:
python -m pico.server api
In this case it will simply use whatever python is in your path. If it is python3 in /Users/richardscheiwe/anaconda3/bin/python but you are getting errors referring to /Library/Python/2.7/ then there is some problem with your anaconda installation/paths in your environment.
There is nothing different with pico to running a plain python script but I suggest you create a simplified script without pico (literally just import pandas) to work out your environment issues with simpler error messages.
If I'm reading this correctly the error seems to come from trying to use a version of NumPy built for running on python 2.6 while your app is running using Python3.
Try removing NumPy using; "sudo pip uninstall numpy" and then use "pip -H install Numpy" to try reinstalling it and seeing if it correctly finds the Python3 version of Numpy
Pypy has a fork of numpy. How to use it in the pypy sandbox?
The sandbox prompt 'No module named numpy'.
However, when I tried to copy the numpy library in the lib_pypy directory, I received this error message. Is there a way to import the numpy in pypy sandbox?
import numpyTraceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/bin/lib_pypy/numpy/__init__.py", line 160, in <module>
raise ImportError(msg)
ImportError: Error importing numpy: you should not try to import numpy from
its source directory; please exit the numpy source tree, and relaunch
your python interpreter from there.
You can't import numpy in a sandboxed PyPy at all, sorry.
This is the case for almost all extension modules. Sandboxing in PyPy is a good proof of concept, really safe imho, but it is only a proof of concept. It seriously requires someone working on it. Before this occurs, it can only be used it for "toy" examples: programs that import mostly nothing (or only pure Python modules, recursively).