Installing PySpark

Installing PySpark - python

I am trying to install PySpark and following the instructions and running this from the command line on the cluster node where I have Spark installed:
$ sbt/sbt assembly
This produces the following error:
-bash: sbt/sbt: No such file or directory
I try the next command:
$ ./bin/pyspark
I get this error:
-bash: ./bin/pyspark: No such file or directory
I feel like I'm missing something basic.
What is missing?
I have spark installed and am able to access it using the command:
$ spark-shell
I have python on the node and am able to open python using the command:
$ python

What's your current working directory? The sbt/sbt and ./bin/pyspark commands are relative to the directory containing Spark's code ($SPARK_HOME), so you should be in that directory when running those commands.
Note that Spark offers pre-built binary distributions that are compatible with many common Hadoop distributions; this may be an easier option if you're using one of those distros.
Also, it looks like you linked to the Spark 0.9.0 documentation; if you're building Spark from scratch, I recommend following the latest version of the documentation.

SBT is used to build a Scala project. If you're new to Scala/SBT/Spark, you're doing things the difficult way.
The easiest way to "install" Spark is to simply download Spark (I recommend Spark 1.6.1 -- personal preference). Then unzip the file in the directory you want to have Spark "installed" in, say C:/spark-folder (Windows) or /home/usr/local/spark-folder (Ubuntu).
After you install it in your desired directory, you need to set your environment variables. Depending on your OS, this will depend; this step is, however, not necessary to run Spark (i.e. pyspark).
If you do not set your environment variables, or don't know how to, an alternative is to simply to go your directory on a terminal window, cd C:/spark-folder (Windows) or cd /home/usr/local/spark-folder (Ubuntu) then type
./bin/pyspark
and spark should run.

Related

Apache Spark's worker python

After installing Apache Spark on 3 nodes on top of Hadoop, I encountered the following problems:
Problem 1- Python version:
I had problem with setting the python on workers. This is the setting in .bashrc file and the same setting is in the spark-env.sh file.
alias python3='/usr/bin/python3'
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
In the Spark logs (yarn logs --applicationId <app_id>) I could see that everything is as expected:
export USER="hadoop"
export LOGNAME="hadoop"
export PYSPARK_PYTHON="python3"
While I installed the pandas library (pip install pandas) on master and worker nodes and made sure it is installed, I constantly received the following message when used the command /home/hadoop/spark/bin/spark-submit --master yarn --deploy-mode cluster sparksql_recommender_system_2.py
ModuleNotFoundError: No module named 'pandas' <br>
Surprisingly this error was just in cluster mode and I didn't have that error in client deployment mode.
The command which python returns /usr/bin/python in which the library pandas exists.
After 2 days I couldn't find my answer on web. By chance, I tried installing pandas using sudo and it worked :).
sudo pip install pandas
However, what I expected was that Spark is going to use the python in /usr/bin/python for the hadoop user, not the root user. How can I fix it?
Problem 2- different behavior of VScode ssh
I use VScode ssh addon to connect to a server on which I develop my codes. When do it from one host (PC) I can use spark-submit, but on my other PC I have to use the exact path /home/hadoop/spark/bin/spark-submit. It is strange because I use VSCode ssh to the same server and files. Any idea how I can solve it?

Here's a great discussion on how to package items up so that your python environment is transferred to executor.
Create the environment
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
Ship it:
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
This does have the disadvantage of having to be shipped every time but it's the safest, and least hassle way to do it. Installing everything on each node is 'faster' but comes with a higher threshold of managment and I suggest avoiding it.
All that said... get off the Pandas. Pandas does python things(small data). Spark Data Frames do Spark things(Big Data). I hope it was just an illustrative example and you aren't going to use Pandas.(It's not a bad! it's just made for small data so use it for small data.) If you "have to" use it, look into Koala's that does a translation to allow you to ask panda things of spark data frames.

Why do I see multiple spark installations directories?

I am working on a ubuntu server which has spark installed in it.
I don't have sudo access to this server.
So under my directory, I created a new virtual environment where I installed pyspark
When I type the below command
whereis spark-shell #see below
/opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell2.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell /home/abcd/.pyenv/shims/spark-shell2.cmd /home/abcd/.pyenv/shims/spark-shell.cmd /home/abcd/.pyenv/shims/spark-shell
another command
echo 'sc.getConf.get("spark.home")' | spark-shell
scala> sc.getConf.get("spark.home")
res0: String = /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark
q1) Am I using the right commands to find the installation directory of spark?
q2) Can help me understand why do I see 3 opt paths and 3 pyenv paths

A spark installation (like the one you have in /opt/spark-2.4.4-bin-hadoop2.7) typically comes with a pyspark installation within it. You can check this by downloading and extracting this tarball (https://www.apache.org/dyn/closer.lua/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz).
If you install pyspark in a virtual environment, you're installing another instance of pyspark which comes without the Scala source code but comes with compiled spark code as jars (see the jars folder in your pyspark installation). pyspark is a wrapper over spark (which is written in Scala). This is probably what you're seeing in /home/abcd/.pyenv/shims/.
The scripts spark-shell2.cmd and spark-shell.cmd in the same directory are part of the same spark installation. These are text files and you can cat them. You will see that spark-shell.cmd calls spark-shell2.cmd within it. You will probably have a lot more scripts in your /opt/spark-2.4.4-bin-hadoop2.7/bin/ folder, all of which are a part of the same spark installation. Same goes for the folder /home/abcd/.pyenv/shims/. Finally, /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark seems like yet another pyspark installation.
It shouldn't matter which pyspark installation you use. In order to use spark, a java process needs to be created that running the Scala/Java code (from the jars in your installation).
Typically, when you run a command like this
# Python code
spark = SparkSession.builder.appName('myappname').getOrCreate()
then you create a new java process that's runs spark.
If you run the script /opt/spark-2.4.4-bin-hadoop2.7/bin/pyspark then you also create a new java process.
You can check if there is indeed such a java process running using something like this: ps aux | grep "java".

Spark installation - Error: Could not find or load main class org.apache.spark.launcher.Main

After spark installation 2.3 and setting the following env variables in .bashrc (using gitbash)
HADOOP_HOME
SPARK_HOME
PYSPARK_PYTHON
JDK_HOME
executing $SPARK_HOME/bin/spark-submit is displaying the following error.
Error: Could not find or load main class org.apache.spark.launcher.Main
I did some research checking in stackoverflow and other sites, but could not figure out the problem.
Execution environment
Windows 10 Enterprise
Spark version - 2.3
Python version - 3.6.4
Can you please provide some pointers?

I had that error message. It probably may have several root causes but this how I investigated and solved the problem (on linux):
instead of launching spark-submit, try using bash -x spark-submit to see which line fails.
do that process several times ( since spark-submit calls nested scripts ) until you find the underlying process called : in my case something like :
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp '/opt/spark-2.2.0-bin-hadoop2.7/conf/:/opt/spark-2.2.0-bin-hadoop2.7/jars/*' -Xmx1g org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name 'Spark shell' spark-shell
So, spark-submit launches a java process and can't find the org.apache.spark.launcher.Main class using the files in /opt/spark-2.2.0-bin-hadoop2.7/jars/* (see the -cp option above). I did an ls in this jars folder and counted 4 files instead of the whole spark distrib (~200 files).
It was probably a problem during the installation process. So I reinstalled spark, checked the jar folder and it worked like a charm.
So, you should:
check the java command (cp option)
check your jars folder ( does it contain ths at least all the spark-*.jar ?)
Hope it helps.

Verify below steps :
spark-launcher_*.jar is present at $SPARK_HOME/jars folder?
explode spark-launcher_*.jar to verify if you have Main.class or not.
If above is true then you may be running spark-submit on windows OS using cygwin terminal.
Try using spark-submit.cmd instead also cygwin parses the drives like /c/ and this will not work in windows so its important to provide the absolute path for the env variables by qualifying it with 'C:/' and not '/c/'.

Check Spark home directory contained all folder and files(xml, jars etc.) otherwise install Spark.
Check your JAVA_HOME and SPARK_HOME environment variable are set in your .bashrc file, try setting the below:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export SPARK_HOME=/home/ubuntu-username/spark-2.4.8-bin-hadoop2.6/
Or wherever your spark is downloaded to
export SPARK_HOME=/home/Downloads/spark-2.4.8-bin-hadoop2.6/
once done, save your .bash and run bash command on terminal or restart the shell and try spark-shell

Virtualenv within single executable

I currently have an executable file that is running Python code inside a zipfile following this: https://blogs.gnome.org/jamesh/2012/05/21/python-zip-files/
The nice thing about this is that I release a single file containing the app. The problems arise in the dependencies. I have attempted to install files using pip in custom locations and when I embed them in the zip I always have import issues or issues that end up depending on host packages.
I then started looking into virtual environments as a way to ensure package dependencies. However, it seems that the typical workflow on the target machine is to source the activation script and run the code within the virtualenv. What I would like to do is have a single file containing a Python script and all its dependencies and for the user to just execute the file. Is this possible given that the Python interpreter is actually packaged with the virtualenv? Is it possible to invoke the Python interpreter from within the zip file? What is the recommended approach for this from a Python point of view?

You can create a bash script that creates the virtual env and runs the python scripts aswell.
!#/bin/bash
virtualenv .venv
.venv/bin/pip install <python packages>
.venv/bin/python script

how to use spark with python or jupyter notebook

I am trying to work with 12GB of data in python for which I desperately need to use Spark , but I guess I'm too stupid to use command line by myself or by using internet and that is why I guess I have to turn to SO ,
So by far I have downloaded the spark and unzipped the tar file or whatever that is ( sorry for the language but I am feeling stupid and out ) but now I can see nowhere to go. I have seen the instruction on spark website documentation and it says :
Spark also provides a Python API. To run Spark interactively in a Python interpreter, use bin/pyspark but where to do this ? please please help .
Edit : I am using windows 10
Note:: I have always faced problems when trying to install something mainly because I can't seem to understand Command prompt

If you are more familiar with jupyter notebook, you can install Apache Toree which integrates pyspark,scala,sql and SparkR kernels with Spark.
for installing toree
pip install toree
jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark
if you want to install other kernels you can use
jupyter toree install --interpreters=SparkR,SQl,Scala
Now run
jupyter notebook
In the UI while selecting new notebook, you should see following kernels availble
Apache Toree-Pyspark
Apache Toree-SparkR
Apache Toree-SQL
Apache Toree-Scala

When you unzip the file, a directory is created.
Open a terminal.
Navigate to that directory with cd.
Do an ls. You will see its contents. bin must be placed
somewhere.
Execute bin/pyspark or maybe ./bin/pyspark.
Of course, in practice it's not that simple, you may need to set some paths, like said in TutorialsPoint, but there are plenty of such links out there.

I understand that you have already installed Spark in the windows 10.
You will need to have winutils.exe available as well. If you haven't already done so, download the file from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and install at say, C:\winutils\bin
Set up environment variables
HADOOP_HOME=C:\winutils
SPARK_HOME=C:\spark or wherever.
PYSPARK_DRIVER_PYTHON=ipython or jupyter notebook
PYSPARK_DRIVER_PYTHON_OPTS=notebook
Now navigate to the C:\Spark directory in a command prompt and type "pyspark"
Jupyter notebook will launch in a browser.
Create a spark context and run a count command as shown.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.