I am having issue with using pyspark in Apache-Zeppelin (version 0.6.0) notebook. Running the following simple code gives me pyspark interpreter not found error
%pyspark
a = 1+3
Running sc.version gave me res2: String = 1.6.0 which is the version of spark installed on my machine. And running z return res0: org.apache.zeppelin.spark.ZeppelinContext = {}
Pyspark works from CLI (using spark 1.6.0 and python 2.6.6)
The default python on the machine 2.6.6, while anaconda-python 3.5 is also installed but not set as default python.
Based on this post I updated the zeppelin-env.sh file located at /usr/hdp/current/zeppelin-server/lib/conf and added Anaconda python 3 path
export PYSPARK_PYTHON=/opt/anaconda3/bin/python
export PYTHONPATH=/opt/anaconda3/bin/python
After that I have stopped and restarted zeppelin many times using
/usr/hdp/current/zeppelin-server/lib/bin/zeppelin-daemon.sh
But I can't get the pyspark interpreter to work in zeppelin.
To people who found out pyspark not responding, please try to restart your spark interpreter in Zeppelin,it may solve pyspark not responding
error.
Related
Starting to learn Spark but now stuck at the first step.
I downloaded Spark from Apache website and have finished the configurations. Now if I run pyspark command in my WSL, a Jupyter server will start and I can open it in my Windows browser and import pyspark works just fine. But if I connect to WSL with VS Code, and create a new notebook in it, then the pyspark module can't be found.
I didn't install pyspark module through pip or conda because I thought it's already included in the full version that I downloaded so it seems redundant to me.
Is there any way that I can use remote installed Spark in VS Code without separately install it again?
I am working on a ubuntu server which has spark installed in it.
I don't have sudo access to this server.
So under my directory, I created a new virtual environment where I installed pyspark
When I type the below command
whereis spark-shell #see below
/opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell2.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell /home/abcd/.pyenv/shims/spark-shell2.cmd /home/abcd/.pyenv/shims/spark-shell.cmd /home/abcd/.pyenv/shims/spark-shell
another command
echo 'sc.getConf.get("spark.home")' | spark-shell
scala> sc.getConf.get("spark.home")
res0: String = /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark
q1) Am I using the right commands to find the installation directory of spark?
q2) Can help me understand why do I see 3 opt paths and 3 pyenv paths
A spark installation (like the one you have in /opt/spark-2.4.4-bin-hadoop2.7) typically comes with a pyspark installation within it. You can check this by downloading and extracting this tarball (https://www.apache.org/dyn/closer.lua/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz).
If you install pyspark in a virtual environment, you're installing another instance of pyspark which comes without the Scala source code but comes with compiled spark code as jars (see the jars folder in your pyspark installation). pyspark is a wrapper over spark (which is written in Scala). This is probably what you're seeing in /home/abcd/.pyenv/shims/.
The scripts spark-shell2.cmd and spark-shell.cmd in the same directory are part of the same spark installation. These are text files and you can cat them. You will see that spark-shell.cmd calls spark-shell2.cmd within it. You will probably have a lot more scripts in your /opt/spark-2.4.4-bin-hadoop2.7/bin/ folder, all of which are a part of the same spark installation. Same goes for the folder /home/abcd/.pyenv/shims/. Finally, /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark seems like yet another pyspark installation.
It shouldn't matter which pyspark installation you use. In order to use spark, a java process needs to be created that running the Scala/Java code (from the jars in your installation).
Typically, when you run a command like this
# Python code
spark = SparkSession.builder.appName('myappname').getOrCreate()
then you create a new java process that's runs spark.
If you run the script /opt/spark-2.4.4-bin-hadoop2.7/bin/pyspark then you also create a new java process.
You can check if there is indeed such a java process running using something like this: ps aux | grep "java".
Currently using Python = 3.5 and Spark = 2.4 versions.
I am trying to run PySpark in a Linux context (git bash) on a Windows machine and I get the following:
set PYSPARK_SUBMIT_ARGS="--name" "PySparkShell" "pyspark-shell" && python3
Does anyone know where I should set these variables? I couldnt't find anything that works for me on google.
spark-shell with Scala works, so I am guessing is something related to the Python config.
Dears,
I am using windows 10 and I am familiar with testing my python code in Spyder.
however, when I am trying to write ïmport pyspark" command, Spyder showing "No module named 'pyspark'"
Pyspark is installed in my PC and also I can do import pyspark in command prompt without any error.
I found many blogs explaining how to do this in Ubuntu but I did not find how to solve it in windows.
Well for using packages in Spyder, you have to install them through Anaconda. You can open
"anaconda prompt" and the write down the blew code:
conda install pyshark
That will give you the package available in SPYDER.
Hi I have installed Pyspark in windows 10 few weeks back. Let me tell you how I did it.
I followed "https://changhsinlee.com/install-pyspark-windows-jupyter/".
So after following each step precisely you can able to run pyspark using either command promp or saving a python file and running.
When you run via notebook(download Anaconda). start anacoda shell and type pyspark. now you don't need to do "ïmport pyspark".
run your program without this and it will be alright. you can also do spark-submit but for that I figured out that you need to remove the PYSPARK_DRIVER_PATH and OPTS PATH in environment variable.
To run Pyspark I have installed it through pip install pyspark. Now to initialize the session after going through many blogs I am running below command
import pyspark
spark = pyspark.sql.SparkSession.builder.appName('test').getOrCreate()
Above code giving me the error
Exception: Java gateway process exited before sending the driver its port number
This will be my first program for spark. I want your advice on whether "pip install pyspark" is enough to run spark on my windows laptop or I need to do something else.
I have Java 8 version installed on my laptop and I am using conda with python 3.6.