Add custom python library path to Pyspark - python

In my hadoop cluster they installed anaconda package in some other path other than python default path. I am getting below error when i try to access numpy in pyspark
ImportError: No module named numpy
I am invoking pyspark using oozie.
I tried to give this custom python library path in below approaches
Using oozie tags
<property>
<name>oozie.launcher.mapreduce.map.env</name>
<value>PYSPARK_PYTHON=/var/opt/teradata/anaconda2/bin/python2.7</value>
</property>
Using spark option tag
<spark-opts>spark.yarn.appMasterEnv.PYSPARK_PYTHON=/var/opt/teradata/anaconda2/bin/python2.7 --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=/var/opt/teradata/anaconda2/bin/python2.7 --conf spark.pyspark.python=/var/opt/teradata/anaconda2/bin/python2.7 --conf spark.pyspark.driver.python=/var/opt/teradata/anaconda2/bin/python2.7</spark-opts>
Nothing works.
When i run plain python script it works fine. Problem is passing to pyspark
Even i gave this in pyspark header also as
#! /usr/bin/env /var/opt/teradata/anaconda2/bin/python2.7
When i print sys.path in my pyspark code it still gives me below default path
​[ '/usr/lib/python27.zip', '/usr/lib64/python2.7', '/usr/lib64/python2.7/plat-linux2', '/usr/lib64/python2.7/lib-tk', '/usr/lib64/python2.7/lib-old', '/usr/lib64/python2.7/lib-dynload', '/usr/lib64/python2.7/site-packages', '/usr/local/lib64/python2.7/site-packages', '/usr/local/lib/python2.7/site-packages', '/usr/lib/python2.7/site-packages']​

I'm having this same issue. In my case it seems like ml classes (e.g. Vector) are calling numpy behind the scenes, but are not looking for it in the standard installation places. Even though the Python2 and Python3 versions of numpy are installed on all nodes of the cluster, PySpark is still complaining it can't find it.
I've tried a lot of suggestions that have not worked.
Two things that have been suggested that I haven't tried:
1) Use the bashrc for the user that PySpark runs as (ubuntu for me) to set the preferred python path. Do this on all nodes.
2) Have the PySpark script attempt to install the module in question as part of its functionality (e.g. by shelling out to pip/pip3).
I'll keep an eye here and if I find an answer I'll post it.

Related

Running a pyspark program on python3 kernel in jupyter notebook

I used pip install pyspark to install PySpark. I didn't set any path etc.; however, I found that everything was downloaded and copied into C:/Users/Admin/anaconda3/scripts. I opened jupyter notebook in a Python3 kernel and I tried to run a SystemML script but it was giving me an error. I realized that I needed to place winutils.exe in C:/Users/Admin/anaconda3/scripts as well, so I did that and the script ran as expected.
Now, my program includes GridSearch and when I run it on my personal laptop, it is markedly slower than how it is on a Cloud data platform where I can initiate a kernel with Spark (such as IBM Watson Studio).
So my questions are:
(i) How do I add PySpark to the Python3 kernel? Or is it already working in the background when I import pyspark?
(ii) When I run the same code on the same dataset using pandas and scikit-learn, there is not much difference in performance. When is PySpark preferred/beneficial over pandas and scikit-learn?
Another thing is, even though PySpark seems to be working fine and I'm able to import its libraries, when I try to run
import findspark
findspark.init()
it throws up and error (on line 2), saying the list is out of range. I googled a bit and found an advice that said that I had to explicitly set SPARK_HOME='C:/Users/Admin/anaconda3/Scripts'; but when I do that, pyspark stops working (findspark.init() still not working).
If anyone can explain what is going on, I'd be very grateful. Thank you.
How do I add PySpark to the Python3 kernel
pip install, like you've said you have done
there is not much difference in performance
You're only using one machine, so there wouldn't be
When is PySpark preferred/beneficial over pandas and scikit-learn?
When you want to deploy the same code onto an actual Spark cluster and your dataset is stored in distributed storage
You don't necessarily need findspark if your environment variables are already setup

ImportError: No module named site - mac

I have had this problem for months. every time that I want to get a new python package and use it I get this error in terminal:
ImportError: No module named site
I don't know why do I get this error. actually, I can't use any new package because every time that I wanna install one I get this error. I searched and I found out that the problem would be for PYTHONPATH and PYTHONHOME but I don't know what they are and how can I change them.
My operating system is mac os and I don't know how to solve this problem in mac.
every time that I open the terminal I use these two commands to solve the problem:
unset PYTHONHOME
unset PYTHONPATH
but the next time I get the error again.
The Python tooling is evolving in Mac OS X such that Python 2.7 is being deprecated. System default paths may no longer apply as I learned after upgrading to Mac OS X Catalina (10.15.x). There are a few scenarios here:
GENERIC FIX -- According to this primer on the site package and Python internal paths, Python 2.7 requires a specific user path for site packages. The solution I settled on was simply this:
mkdir -p ~/.local/lib/python2.7/site-packages
At that point you will have a dedicated directory for Python modules (for Python to function). (NOTE that the error ImportError: No module named site is misleading as it really indicates that the correct directory structure didn't exist to allow the site module to properly load.)
BASIC ALTERNATIVE -- Is your PYTHONHOME pointing to Python3 and your python --version reporting 2.7? Does python3 --version report something different (or work at all)? This error started happening for me after the Catalina upgrade when I was trying to use built-ins. Check your .profile and .bash_profile to see if they are explicitly setting a custom PYTHONHOME and/or PYTHONPATH. One option is to change that so the ENV variables are set explicitly and manually only when a newer Python is needed. You may consider symlinking python3 and pip3 to leave the standard commands to the OS.
DEBUGGING -- If you would like to test and/or get more information, try executing Python in one of the increasingly verbose modes:
python -v
python -vv
python -vvv

Using a JAR dependency in a PySpark parallelized execution context

This is for a PySpark / Databricks project:
I've written a Scala JAR library and exposed its functions as UDFs via a simple Python wrapper; everything works as it should in my PySpark notebooks. However, when I try to use any of the functions imported from the JAR in an sc.parallelize(..).foreach(..) environment, execution keeps dying with the following error:
TypeError: 'JavaPackage' object is not callable
at this line in the wrapper:
jc = get_spark()._jvm.com.company.package.class.get_udf(function.__name__)
My suspicious is that the JAR library is not available in the parallelized context, since if I replace the library path to some gibberish, the error remains exactly the same.
I haven't been able to find the necessary clues in the Spark docs so far. Using an sc.addFile("dbfs:/FileStore/path-to-library.jar") didn't help.
You could try adding the JAR in the PYSPARK_SUBMIT_ARGS environment variable (before Spark 2.3 this was doable with SPARK_CLASSPATH as well).
For example with:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars <path/to/jar> pyspark-shell'

No module named 'resource' installing Apache Spark on Windows

I am trying to install apache spark to run locally on my windows machine. I have followed all instructions here https://medium.com/#loldja/installing-apache-spark-pyspark-the-missing-quick-start-guide-for-windows-ad81702ba62d.
After this installation I am able to successfully start pyspark, and execute a command such as
textFile = sc.textFile("README.md")
When I then execute a command that operates on textFile such as
textFile.first()
Spark gives me the error 'worker failed to connect back', and I can see an exception in the console coming from worker.py saying 'ModuleNotFoundError: No module named resource'. Looking at the source file I can see that this python file does indeed try to import the resource module, however this module is not available on windows systems. I understand that you can install spark on windows so how do I get around this?
I struggled the whole morning with the same problem. Your best bet is to downgrade to Spark 2.3.2
The fix can be found at https://github.com/apache/spark/pull/23055.
The resource module is only for Unix/Linux systems and is not applicaple in a windows environment. This fix is not yet included in the latest release but you can modify the worker.py in your installation as shown in the pull request. The changes to that file can be found at https://github.com/apache/spark/pull/23055/files.
You will have to re-zip the pyspark directory and move it the lib folder in your pyspark installation directory (where you extracted the pre-compiled pyspark according to the tutorial you mentioned)
Adding to all those valuable answers,
For windows users,Make sure you have copied the correct version of the winutils.exe file(for your specific version of Hadoop) to the spark/bin folder
Say,
if you have Hadoop 2.7.1, then you should copy the winutils.exe file from the Hadoop 2.7.1/bin folder
Link for that is here
https://github.com/steveloughran/winutils
I edited worker.py file. Removed all resource-related lines. Actually # set up memory limits block and import resource. The error disappeared.

ImportError: No module named caffe while running spark-submit

While running a spark-submit on a spark standalone cluster comprised of one master and 1 worker, the caffe python module does not get imported due to error ImportError: No module named caffe
This doesn't seem to be an issue whenever I run a job locally by
spark-submit --master local script.py the caffe module gets imported just fine.
The environmental variables are currently set under ~/.profile for spark and caffe and they are pointing to the PYTHONPATH.
Is ~/.profile the correct location to set these variables or perhaps a system wide configuration is needed such as adding the variables under /etc/profile.d/
Please note that the CaffeOnSpark team ported Caffe to a distributed environment backed by Hadoop and Spark. You cannot, I am 99.99% sure, use Caffe alone (without any modifications) in a Spark cluster or any distributed environment per se. (Caffe team is known to be working on this).
If you need distributed deep-learning using Caffe, please follow the building method mentioned here in https://github.com/yahoo/CaffeOnSpark/wiki/build to build CaffeOnSpark for that and use CaffeOnSpark instead of Caffe.
But, best bet will be to follow either this GetStarted_standalone wiki or GetStarted_yarn wiki to create you a distributed environment to carry out deep-learning.
Further, to add python, please go through GetStarted_python wiki.
Also, since you mentioned that you are using Ubuntu here, please use ~/.bashrc to update environment your variables. You will have to source the file after the changes: source ~/.bashrc

Categories

Resources