ImportError: No module named caffe while running spark-submit - python

While running a spark-submit on a spark standalone cluster comprised of one master and 1 worker, the caffe python module does not get imported due to error ImportError: No module named caffe
This doesn't seem to be an issue whenever I run a job locally by
spark-submit --master local script.py the caffe module gets imported just fine.
The environmental variables are currently set under ~/.profile for spark and caffe and they are pointing to the PYTHONPATH.
Is ~/.profile the correct location to set these variables or perhaps a system wide configuration is needed such as adding the variables under /etc/profile.d/

Please note that the CaffeOnSpark team ported Caffe to a distributed environment backed by Hadoop and Spark. You cannot, I am 99.99% sure, use Caffe alone (without any modifications) in a Spark cluster or any distributed environment per se. (Caffe team is known to be working on this).
If you need distributed deep-learning using Caffe, please follow the building method mentioned here in https://github.com/yahoo/CaffeOnSpark/wiki/build to build CaffeOnSpark for that and use CaffeOnSpark instead of Caffe.
But, best bet will be to follow either this GetStarted_standalone wiki or GetStarted_yarn wiki to create you a distributed environment to carry out deep-learning.
Further, to add python, please go through GetStarted_python wiki.
Also, since you mentioned that you are using Ubuntu here, please use ~/.bashrc to update environment your variables. You will have to source the file after the changes: source ~/.bashrc

Related

Can I zip PySpark dependencies containing some setuptools.Extension?

I am attempting to include the dateparser package for a PySpark (v2.4.3) shell session by a short little zip build process pip install -r requirements.txt -t some_target && cd some_target && zip -r ../deps.zip . && cd .., after which I would, for example, pyspark --py-files deps.zip. When importing dateparser, though, I get an indirect ModuleNotFoundError from the regex library, whining that "No module named 'regex._regex'" (stack trace says this is referenced in /mnt/tmp/spark-some/long/path/deps.zip/regex/_regex_core.py line 21, which is of course referenced much farther up the stack by dateparser).
I attempted adding a flag to the dateparser line in requirements.txt like dateparser --no-binary=regex, but the error persisted. A normal python shell is able to import without issue, and other packages in this zip seem to be importable in PySpark shell without issue. This has led me down a number of rabbit holes, but I think/hope I have finally found the culprit: namely, that regex._regex is not a normal .py file, but rather a .so. My knowledge of python build process is limited, but it seems that regex library's setup.py uses the setuptools.Extension class to compile some C files into this shared object. I have seen suggestions to modify LD_LIBRARY_PATH environment variable in order to make those shared objects discoverable to python, but a number of comments also suggested this was dangerous and not a viable long-term solution. The fact that a normal python interactive session has no issue with the import also has me skeptical, since the LD_LIBRARY_PATH variable doesn't even exist in os.environ within that interactive shell. I'm thence left wondering if --py-files is insufficient for including packages that compile these Extension objects (seems unlikely, since there are a lot of people doing crazier things than my simple use case), or if this actually stems from neglecting some other setting.
Merci mille fois for any and all help :)
The error appears to stem from the import statements not being able to recognize binary (.so) files within a zip archive, i.e., the dependencies.zip that I pass with the --py-files parameter. I first tried pulling out regex dependency and building a .whl to include in --py-files, to discover that my version of PySpark (v2.4.3) predates wheel support. I was, however, able to build an .egg based on the source code, then set PYTHON_EGG_CACHE and PYTHON_EGG_DIR env variables for spark.executorEnv and spark.driverEnv... Not sure if the last step would be necessary for others; it seems to have stemmed from weird permissions issues that may just apply to my user/group/use case.

ImportError: No module named site - mac

I have had this problem for months. every time that I want to get a new python package and use it I get this error in terminal:
ImportError: No module named site
I don't know why do I get this error. actually, I can't use any new package because every time that I wanna install one I get this error. I searched and I found out that the problem would be for PYTHONPATH and PYTHONHOME but I don't know what they are and how can I change them.
My operating system is mac os and I don't know how to solve this problem in mac.
every time that I open the terminal I use these two commands to solve the problem:
unset PYTHONHOME
unset PYTHONPATH
but the next time I get the error again.
The Python tooling is evolving in Mac OS X such that Python 2.7 is being deprecated. System default paths may no longer apply as I learned after upgrading to Mac OS X Catalina (10.15.x). There are a few scenarios here:
GENERIC FIX -- According to this primer on the site package and Python internal paths, Python 2.7 requires a specific user path for site packages. The solution I settled on was simply this:
mkdir -p ~/.local/lib/python2.7/site-packages
At that point you will have a dedicated directory for Python modules (for Python to function). (NOTE that the error ImportError: No module named site is misleading as it really indicates that the correct directory structure didn't exist to allow the site module to properly load.)
BASIC ALTERNATIVE -- Is your PYTHONHOME pointing to Python3 and your python --version reporting 2.7? Does python3 --version report something different (or work at all)? This error started happening for me after the Catalina upgrade when I was trying to use built-ins. Check your .profile and .bash_profile to see if they are explicitly setting a custom PYTHONHOME and/or PYTHONPATH. One option is to change that so the ENV variables are set explicitly and manually only when a newer Python is needed. You may consider symlinking python3 and pip3 to leave the standard commands to the OS.
DEBUGGING -- If you would like to test and/or get more information, try executing Python in one of the increasingly verbose modes:
python -v
python -vv
python -vvv

No module named 'resource' installing Apache Spark on Windows

I am trying to install apache spark to run locally on my windows machine. I have followed all instructions here https://medium.com/#loldja/installing-apache-spark-pyspark-the-missing-quick-start-guide-for-windows-ad81702ba62d.
After this installation I am able to successfully start pyspark, and execute a command such as
textFile = sc.textFile("README.md")
When I then execute a command that operates on textFile such as
textFile.first()
Spark gives me the error 'worker failed to connect back', and I can see an exception in the console coming from worker.py saying 'ModuleNotFoundError: No module named resource'. Looking at the source file I can see that this python file does indeed try to import the resource module, however this module is not available on windows systems. I understand that you can install spark on windows so how do I get around this?
I struggled the whole morning with the same problem. Your best bet is to downgrade to Spark 2.3.2
The fix can be found at https://github.com/apache/spark/pull/23055.
The resource module is only for Unix/Linux systems and is not applicaple in a windows environment. This fix is not yet included in the latest release but you can modify the worker.py in your installation as shown in the pull request. The changes to that file can be found at https://github.com/apache/spark/pull/23055/files.
You will have to re-zip the pyspark directory and move it the lib folder in your pyspark installation directory (where you extracted the pre-compiled pyspark according to the tutorial you mentioned)
Adding to all those valuable answers,
For windows users,Make sure you have copied the correct version of the winutils.exe file(for your specific version of Hadoop) to the spark/bin folder
Say,
if you have Hadoop 2.7.1, then you should copy the winutils.exe file from the Hadoop 2.7.1/bin folder
Link for that is here
https://github.com/steveloughran/winutils
I edited worker.py file. Removed all resource-related lines. Actually # set up memory limits block and import resource. The error disappeared.

Adding python modules to AzureML workspace

I've been working recently on deploying a machine learning model as a web service. I used Azure Machine Learning Studio for creating my own Workspace ID and Authorization Token. Then, I trained LogisticRegressionCV model from sklearn.linear_model locally on my machine (using python 2.7.13) and with the usage of below code snippet I wanted to publish my model as web service:
from azureml import services
#services.publish('workspaceID','authorization_token')
#services.types(var_1= float, var_2= float)
#services.returns(int)
def predicting(var_1, var_2):
input = np.array([var_1, var_2].reshape(1,-1)
return model.predict_proba(input)[0][1]
where input variable is a list with data to be scored and model variable contains trained classifier. Then after defining above function I want to make a prediction on sample input vector:
predicting.service(1.21, 1.34)
However following error occurs:
RuntimeError: Error 0085: The following error occurred during script
evaluation, please view the output log for more information:
And the most important message in log is:
AttributeError: 'module' object has no attribute 'LogisticRegressionCV'
The error is strange to me because when I was using normal sklearn.linear_model.LogisticRegression everything was fine. I was able to make predictions sending POST requests to created endpoint, so I guess sklearn worked correctly.
After changing to LogisticRegressionCV it does not.
Therefore I wanted to update sklearn on my workspace.
Do you have any ideas how to do it? Or even more general question: how to install any python module on azure machine learning studio in a way to use predict functions of any model I develpoed locally?
Thanks
For anyone who came across this question like I did in hopes of installing modules in AzureML notebooks; it seems the current environments sit on Conda on the compute so it's now as simple as executing
!conda env list
# conda environments:
#
base * /anaconda
azureml_py36 /anaconda/envs/azureml_py36
!conda -n azureml_py36 -y <packages>
from within the notebook environment or doing pretty much the same without the ! in the terminal environment
For installing python module on Azure ML Studio, there is a section Technical Notes of the offical document Execute Python Script which introduces it.
The general steps as below.
Create a Python project via virtualenv and active it.
Install all packages you want via pip on the virtual Python environment, and then
Package all files and directorys under the path Lib\site-packages of your project as a zip file.
Upload the zip package into your Azure ML WorkSpace as a dataSet.
Follow the offical document to import Python Module for your Execute Python Script.
For more details, you can refer to the other similar SO thread Updating pandas to version 0.19 in Azure ML Studio, it even introduced how to update the version of Python packages installed by Azure.
Hope it helps.
I struggled with the same issue: error 0085
I was able to resolve it by using Azure ML code example available from their library:
Deployment of AzureML Web Services from Python Notebooks
can be found at https://gallery.cortanaintelligence.com/Notebook/Deployment-of-AzureML-Web-Services-from-Python-Notebooks-4
I won't copy the whole code here, but I used it exactly as is and it worked with Boston dataset. Then I used it with my dataset, and I no longer got error 0085. I haven't tracked down the error yet but it's most likely due to some misbehaving character or indent. Hope this helps.

What other ways can python look for modules?

We have a software called ArcGIS that comes with a python environment, which has a library called arcpy
When you execute the python.exe from that environment, it imports arcpy with no issue.
But I needed to create another python enviroment that contains the same library as this one, but I just couldn't find anything named arcpy in the enviroment's folders
I even copied the whole Lib folder from the original enviroment to the one I'm trying to create, but it still won't import arcpy
I know this is kinda of a shot in the dark, as it is a proprietary library and I can't be sharing much info, but does anyone knows what could it be?
It seems they use Anaconda too
The python (arcpy) install with ArcGIS typicall installs to:
C:\Python27\ArcGIS10.5
Arcpy does not like to be moved, and the library is linked directly with your ArcGIS installation
C:\Program Files (x86)\ArcGIS\Desktop10.5
If you are using ArcGIS Pro rather than Desktop, it installs into Conda environment:
C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\
This Q&A on GIS Stack Exchange may be of some interest to you - How to set up Python/ArcPy with ArcGIS Pro 1.3
Go inside the environment with arcpy, look for the environment var PYTHON_PATH. and just add that path to the PYTHON_PATH in your new environment.

Categories

Resources