How to use remote Spark in local vs code? - python

Starting to learn Spark but now stuck at the first step.
I downloaded Spark from Apache website and have finished the configurations. Now if I run pyspark command in my WSL, a Jupyter server will start and I can open it in my Windows browser and import pyspark works just fine. But if I connect to WSL with VS Code, and create a new notebook in it, then the pyspark module can't be found.
I didn't install pyspark module through pip or conda because I thought it's already included in the full version that I downloaded so it seems redundant to me.
Is there any way that I can use remote installed Spark in VS Code without separately install it again?

Related

Why do I see multiple spark installations directories?

I am working on a ubuntu server which has spark installed in it.
I don't have sudo access to this server.
So under my directory, I created a new virtual environment where I installed pyspark
When I type the below command
whereis spark-shell #see below
/opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell2.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell /home/abcd/.pyenv/shims/spark-shell2.cmd /home/abcd/.pyenv/shims/spark-shell.cmd /home/abcd/.pyenv/shims/spark-shell
another command
echo 'sc.getConf.get("spark.home")' | spark-shell
scala> sc.getConf.get("spark.home")
res0: String = /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark
q1) Am I using the right commands to find the installation directory of spark?
q2) Can help me understand why do I see 3 opt paths and 3 pyenv paths
A spark installation (like the one you have in /opt/spark-2.4.4-bin-hadoop2.7) typically comes with a pyspark installation within it. You can check this by downloading and extracting this tarball (https://www.apache.org/dyn/closer.lua/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz).
If you install pyspark in a virtual environment, you're installing another instance of pyspark which comes without the Scala source code but comes with compiled spark code as jars (see the jars folder in your pyspark installation). pyspark is a wrapper over spark (which is written in Scala). This is probably what you're seeing in /home/abcd/.pyenv/shims/.
The scripts spark-shell2.cmd and spark-shell.cmd in the same directory are part of the same spark installation. These are text files and you can cat them. You will see that spark-shell.cmd calls spark-shell2.cmd within it. You will probably have a lot more scripts in your /opt/spark-2.4.4-bin-hadoop2.7/bin/ folder, all of which are a part of the same spark installation. Same goes for the folder /home/abcd/.pyenv/shims/. Finally, /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark seems like yet another pyspark installation.
It shouldn't matter which pyspark installation you use. In order to use spark, a java process needs to be created that running the Scala/Java code (from the jars in your installation).
Typically, when you run a command like this
# Python code
spark = SparkSession.builder.appName('myappname').getOrCreate()
then you create a new java process that's runs spark.
If you run the script /opt/spark-2.4.4-bin-hadoop2.7/bin/pyspark then you also create a new java process.
You can check if there is indeed such a java process running using something like this: ps aux | grep "java".

Running pyspark in (Anaconda - Spyder) in windows OS

Dears,
I am using windows 10 and I am familiar with testing my python code in Spyder.
however, when I am trying to write ïmport pyspark" command, Spyder showing "No module named 'pyspark'"
Pyspark is installed in my PC and also I can do import pyspark in command prompt without any error.
I found many blogs explaining how to do this in Ubuntu but I did not find how to solve it in windows.
Well for using packages in Spyder, you have to install them through Anaconda. You can open
"anaconda prompt" and the write down the blew code:
conda install pyshark
That will give you the package available in SPYDER.
Hi I have installed Pyspark in windows 10 few weeks back. Let me tell you how I did it.
I followed "https://changhsinlee.com/install-pyspark-windows-jupyter/".
So after following each step precisely you can able to run pyspark using either command promp or saving a python file and running.
When you run via notebook(download Anaconda). start anacoda shell and type pyspark. now you don't need to do "ïmport pyspark".
run your program without this and it will be alright. you can also do spark-submit but for that I figured out that you need to remove the PYSPARK_DRIVER_PATH and OPTS PATH in environment variable.

Executing Pyspark on windows giving error

To run Pyspark I have installed it through pip install pyspark. Now to initialize the session after going through many blogs I am running below command
import pyspark
spark = pyspark.sql.SparkSession.builder.appName('test').getOrCreate()
Above code giving me the error
Exception: Java gateway process exited before sending the driver its port number
This will be my first program for spark. I want your advice on whether "pip install pyspark" is enough to run spark on my windows laptop or I need to do something else.
I have Java 8 version installed on my laptop and I am using conda with python 3.6.

how to use spark with python or jupyter notebook

I am trying to work with 12GB of data in python for which I desperately need to use Spark , but I guess I'm too stupid to use command line by myself or by using internet and that is why I guess I have to turn to SO ,
So by far I have downloaded the spark and unzipped the tar file or whatever that is ( sorry for the language but I am feeling stupid and out ) but now I can see nowhere to go. I have seen the instruction on spark website documentation and it says :
Spark also provides a Python API. To run Spark interactively in a Python interpreter, use bin/pyspark but where to do this ? please please help .
Edit : I am using windows 10
Note:: I have always faced problems when trying to install something mainly because I can't seem to understand Command prompt
If you are more familiar with jupyter notebook, you can install Apache Toree which integrates pyspark,scala,sql and SparkR kernels with Spark.
for installing toree
pip install toree
jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark
if you want to install other kernels you can use
jupyter toree install --interpreters=SparkR,SQl,Scala
Now run
jupyter notebook
In the UI while selecting new notebook, you should see following kernels availble
Apache Toree-Pyspark
Apache Toree-SparkR
Apache Toree-SQL
Apache Toree-Scala
When you unzip the file, a directory is created.
Open a terminal.
Navigate to that directory with cd.
Do an ls. You will see its contents. bin must be placed
somewhere.
Execute bin/pyspark or maybe ./bin/pyspark.
Of course, in practice it's not that simple, you may need to set some paths, like said in TutorialsPoint, but there are plenty of such links out there.
I understand that you have already installed Spark in the windows 10.
You will need to have winutils.exe available as well. If you haven't already done so, download the file from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and install at say, C:\winutils\bin
Set up environment variables
HADOOP_HOME=C:\winutils
SPARK_HOME=C:\spark or wherever.
PYSPARK_DRIVER_PYTHON=ipython or jupyter notebook
PYSPARK_DRIVER_PYTHON_OPTS=notebook
Now navigate to the C:\Spark directory in a command prompt and type "pyspark"
Jupyter notebook will launch in a browser.
Create a spark context and run a count command as shown.

pyspark interpreter not found in apache zeppelin

I am having issue with using pyspark in Apache-Zeppelin (version 0.6.0) notebook. Running the following simple code gives me pyspark interpreter not found error
%pyspark
a = 1+3
Running sc.version gave me res2: String = 1.6.0 which is the version of spark installed on my machine. And running z return res0: org.apache.zeppelin.spark.ZeppelinContext = {}
Pyspark works from CLI (using spark 1.6.0 and python 2.6.6)
The default python on the machine 2.6.6, while anaconda-python 3.5 is also installed but not set as default python.
Based on this post I updated the zeppelin-env.sh file located at /usr/hdp/current/zeppelin-server/lib/conf and added Anaconda python 3 path
export PYSPARK_PYTHON=/opt/anaconda3/bin/python
export PYTHONPATH=/opt/anaconda3/bin/python
After that I have stopped and restarted zeppelin many times using
/usr/hdp/current/zeppelin-server/lib/bin/zeppelin-daemon.sh
But I can't get the pyspark interpreter to work in zeppelin.
To people who found out pyspark not responding, please try to restart your spark interpreter in Zeppelin,it may solve pyspark not responding
error.

Categories

Resources