Apache Spark's worker python

Apache Spark's worker python - python

After installing Apache Spark on 3 nodes on top of Hadoop, I encountered the following problems:
Problem 1- Python version:
I had problem with setting the python on workers. This is the setting in .bashrc file and the same setting is in the spark-env.sh file.
alias python3='/usr/bin/python3'
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
In the Spark logs (yarn logs --applicationId <app_id>) I could see that everything is as expected:
export USER="hadoop"
export LOGNAME="hadoop"
export PYSPARK_PYTHON="python3"
While I installed the pandas library (pip install pandas) on master and worker nodes and made sure it is installed, I constantly received the following message when used the command /home/hadoop/spark/bin/spark-submit --master yarn --deploy-mode cluster sparksql_recommender_system_2.py
ModuleNotFoundError: No module named 'pandas' <br>
Surprisingly this error was just in cluster mode and I didn't have that error in client deployment mode.
The command which python returns /usr/bin/python in which the library pandas exists.
After 2 days I couldn't find my answer on web. By chance, I tried installing pandas using sudo and it worked :).
sudo pip install pandas
However, what I expected was that Spark is going to use the python in /usr/bin/python for the hadoop user, not the root user. How can I fix it?
Problem 2- different behavior of VScode ssh
I use VScode ssh addon to connect to a server on which I develop my codes. When do it from one host (PC) I can use spark-submit, but on my other PC I have to use the exact path /home/hadoop/spark/bin/spark-submit. It is strange because I use VSCode ssh to the same server and files. Any idea how I can solve it?

Here's a great discussion on how to package items up so that your python environment is transferred to executor.
Create the environment
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
Ship it:
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
This does have the disadvantage of having to be shipped every time but it's the safest, and least hassle way to do it. Installing everything on each node is 'faster' but comes with a higher threshold of managment and I suggest avoiding it.
All that said... get off the Pandas. Pandas does python things(small data). Spark Data Frames do Spark things(Big Data). I hope it was just an illustrative example and you aren't going to use Pandas.(It's not a bad! it's just made for small data so use it for small data.) If you "have to" use it, look into Koala's that does a translation to allow you to ask panda things of spark data frames.

Related

Why do I see multiple spark installations directories?

I am working on a ubuntu server which has spark installed in it.
I don't have sudo access to this server.
So under my directory, I created a new virtual environment where I installed pyspark
When I type the below command
whereis spark-shell #see below
/opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell2.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell.cmd /opt/spark-2.4.4-bin-hadoop2.7/bin/spark-shell /home/abcd/.pyenv/shims/spark-shell2.cmd /home/abcd/.pyenv/shims/spark-shell.cmd /home/abcd/.pyenv/shims/spark-shell
another command
echo 'sc.getConf.get("spark.home")' | spark-shell
scala> sc.getConf.get("spark.home")
res0: String = /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark
q1) Am I using the right commands to find the installation directory of spark?
q2) Can help me understand why do I see 3 opt paths and 3 pyenv paths

A spark installation (like the one you have in /opt/spark-2.4.4-bin-hadoop2.7) typically comes with a pyspark installation within it. You can check this by downloading and extracting this tarball (https://www.apache.org/dyn/closer.lua/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz).
If you install pyspark in a virtual environment, you're installing another instance of pyspark which comes without the Scala source code but comes with compiled spark code as jars (see the jars folder in your pyspark installation). pyspark is a wrapper over spark (which is written in Scala). This is probably what you're seeing in /home/abcd/.pyenv/shims/.
The scripts spark-shell2.cmd and spark-shell.cmd in the same directory are part of the same spark installation. These are text files and you can cat them. You will see that spark-shell.cmd calls spark-shell2.cmd within it. You will probably have a lot more scripts in your /opt/spark-2.4.4-bin-hadoop2.7/bin/ folder, all of which are a part of the same spark installation. Same goes for the folder /home/abcd/.pyenv/shims/. Finally, /home/abcd/.pyenv/versions/bio/lib/python3.7/site-packages/pyspark seems like yet another pyspark installation.
It shouldn't matter which pyspark installation you use. In order to use spark, a java process needs to be created that running the Scala/Java code (from the jars in your installation).
Typically, when you run a command like this
# Python code
spark = SparkSession.builder.appName('myappname').getOrCreate()
then you create a new java process that's runs spark.
If you run the script /opt/spark-2.4.4-bin-hadoop2.7/bin/pyspark then you also create a new java process.
You can check if there is indeed such a java process running using something like this: ps aux | grep "java".

Spark installation - Error: Could not find or load main class org.apache.spark.launcher.Main

After spark installation 2.3 and setting the following env variables in .bashrc (using gitbash)
HADOOP_HOME
SPARK_HOME
PYSPARK_PYTHON
JDK_HOME
executing $SPARK_HOME/bin/spark-submit is displaying the following error.
Error: Could not find or load main class org.apache.spark.launcher.Main
I did some research checking in stackoverflow and other sites, but could not figure out the problem.
Execution environment
Windows 10 Enterprise
Spark version - 2.3
Python version - 3.6.4
Can you please provide some pointers?

I had that error message. It probably may have several root causes but this how I investigated and solved the problem (on linux):
instead of launching spark-submit, try using bash -x spark-submit to see which line fails.
do that process several times ( since spark-submit calls nested scripts ) until you find the underlying process called : in my case something like :
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp '/opt/spark-2.2.0-bin-hadoop2.7/conf/:/opt/spark-2.2.0-bin-hadoop2.7/jars/*' -Xmx1g org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name 'Spark shell' spark-shell
So, spark-submit launches a java process and can't find the org.apache.spark.launcher.Main class using the files in /opt/spark-2.2.0-bin-hadoop2.7/jars/* (see the -cp option above). I did an ls in this jars folder and counted 4 files instead of the whole spark distrib (~200 files).
It was probably a problem during the installation process. So I reinstalled spark, checked the jar folder and it worked like a charm.
So, you should:
check the java command (cp option)
check your jars folder ( does it contain ths at least all the spark-*.jar ?)
Hope it helps.

Verify below steps :
spark-launcher_*.jar is present at $SPARK_HOME/jars folder?
explode spark-launcher_*.jar to verify if you have Main.class or not.
If above is true then you may be running spark-submit on windows OS using cygwin terminal.
Try using spark-submit.cmd instead also cygwin parses the drives like /c/ and this will not work in windows so its important to provide the absolute path for the env variables by qualifying it with 'C:/' and not '/c/'.

Check Spark home directory contained all folder and files(xml, jars etc.) otherwise install Spark.
Check your JAVA_HOME and SPARK_HOME environment variable are set in your .bashrc file, try setting the below:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export SPARK_HOME=/home/ubuntu-username/spark-2.4.8-bin-hadoop2.6/
Or wherever your spark is downloaded to
export SPARK_HOME=/home/Downloads/spark-2.4.8-bin-hadoop2.6/
once done, save your .bash and run bash command on terminal or restart the shell and try spark-shell

Configuring Spark to work with Jupyter Notebook and Anaconda

I've spent a few days now trying to make Spark work with my Jupyter Notebook and Anaconda. Here's what my .bash_profile looks like:
PATH="/my/path/to/anaconda3/bin:$PATH"
export JAVA_HOME="/my/path/to/jdk"
export PYTHON_PATH="/my/path/to/anaconda3/bin/python"
export PYSPARK_PYTHON="/my/path/to/anaconda3/bin/python"
export PATH=$PATH:/my/path/to/spark-2.1.0-bin-hadoop2.7/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
export SPARK_HOME=/my/path/to/spark-2.1.0-bin-hadoop2.7
alias pyspark="pyspark --conf spark.local.dir=/home/puifais --num-executors 30 --driver-memory 128g --executor-memory 6g --packages com.databricks:spark-csv_2.11:1.5.0"
When I type /my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell, I can launch Spark just fine in my command line shell. And the output sc is not empty. It seems to work fine.
When I type pyspark, it launches my Jupyter Notebook fine. When I create a new Python3 notebook, this error appears:
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
And sc in my Jupyter Notebook is empty.
Can anyone help solve this situation?
Just want to clarify: There is nothing after the colon at the end of the error. I also tried to create my own start-up file using this post and I quote here so you don't have to go look there:
I created a short initialization script init_spark.py as follows:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf = conf)
and placed it in the ~/.ipython/profile_default/startup/ directory
When I did this, the error then became:
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
[IPKernelApp] WARNING | Unknown error in handling startup files:

Well, it really gives me pain to see how crappy hacks, like setting PYSPARK_DRIVER_PYTHON=jupyter, have been promoted to "solutions" and tend now to become standard practices, despite the fact that they evidently lead to ugly outcomes, like typing pyspark and ending up with a Jupyter notebook instead of a PySpark shell, plus yet-unseen problems lurking downstream, such as when you try to use spark-submit with the above settings... :(
(Don't get me wrong, it is not your fault and I am not blaming you; I have seen dozens of posts here at SO where this "solution" has been proposed, accepted, and upvoted...).
At the time of writing (Dec 2017), there is one and only one proper way to customize a Jupyter notebook in order to work with other languages (PySpark here), and this is the use of Jupyter kernels.
The first thing to do is run a jupyter kernelspec list command, to get the list of any already available kernels in your machine; here is the result in my case (Ubuntu):
$ jupyter kernelspec list
Available kernels:
python2 /usr/lib/python2.7/site-packages/ipykernel/resources
caffe /usr/local/share/jupyter/kernels/caffe
ir /usr/local/share/jupyter/kernels/ir
pyspark /usr/local/share/jupyter/kernels/pyspark
pyspark2 /usr/local/share/jupyter/kernels/pyspark2
tensorflow /usr/local/share/jupyter/kernels/tensorflow
The first kernel, python2, is the "default" one coming with IPython (there is a great chance of this being the only one present in your system); as for the rest, I have 2 more Python kernels (caffe & tensorflow), an R one (ir), and two PySpark kernels for use with Spark 1.6 and Spark 2.0 respectively.
The entries of the list above are directories, and each one contains one single file, named kernel.json. Let's see the contents of this file for my pyspark2 kernel:
{
"display_name": "PySpark (Spark 2.0)",
"language": "python",
"argv": [
"/opt/intel/intelpython27/bin/python2",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
"PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
"PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
}
}
I have not bothered to change my details to /my/path/to etc., and you can already see that there are some differences between our cases (I use Intel Python 2.7, and not Anaconda Python 3), but hopefully you get the idea (BTW, don't worry about the connection_file - I don't use one either).
Now, the easiest way for you would be to manually do the necessary changes (paths only) to my above shown kernel and save it in a new subfolder of the .../jupyter/kernels directory (that way, it should be visible if you run again a jupyter kernelspec list command). And if you think this approach is also a hack, well, I would agree with you, but it is the one recommended in the Jupyter documentation (page 12):
However, there isn’t a great way to modify the kernelspecs. One approach uses jupyter kernelspec list to find the kernel.json file and then modifies it, e.g. kernels/python3/kernel.json, by hand.
If you don't have already a .../jupyter/kernels folder, you can still install a new kernel using jupyter kernelspec install - haven't tried it, but have a look at this SO answer.
Finally, don't forget to remove all the PySpark-related environment variables from your bash profile (leaving only SPARK_HOME should be OK). And confirm that, when you type pyspark, you find yourself with a PySpark shell, as it should be, and not with a Jupyter notebook...
UPDATE (after comment): If you want to pass command-line arguments to PySpark, you should add the PYSPARK_SUBMIT_ARGS setting under env; for example, here is the last line of my respective kernel file for Spark 1.6.0, where we still had to use the external spark-csv package for reading CSV files:
"PYSPARK_SUBMIT_ARGS": "--master local --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"

Conda can help correctly manage a lot of dependencies...
Install spark. Assuming spark is installed in /opt/spark, include this in your ~/.bashrc:
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
Create a conda environment with all needed dependencies apart from spark:
conda create -n findspark-jupyter-openjdk8-py3 -c conda-forge python=3.5 jupyter=1.0 notebook=5.0 openjdk=8.0.144 findspark=1.1.0
Activate the environment
$ source activate findspark-jupyter-openjdk8-py3
Launch a Jupyter Notebook server:
$ jupyter notebook
In your browser, create a new Python3 notebook
Try calculating PI with the following script (borrowed from this)
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

I just conda installed sparkmagic (after re-installing a newer version of Spark).
I think that alone simply works, and it is much simpler than fiddling configuration files by hand.

how to use spark with python or jupyter notebook

I am trying to work with 12GB of data in python for which I desperately need to use Spark , but I guess I'm too stupid to use command line by myself or by using internet and that is why I guess I have to turn to SO ,
So by far I have downloaded the spark and unzipped the tar file or whatever that is ( sorry for the language but I am feeling stupid and out ) but now I can see nowhere to go. I have seen the instruction on spark website documentation and it says :
Spark also provides a Python API. To run Spark interactively in a Python interpreter, use bin/pyspark but where to do this ? please please help .
Edit : I am using windows 10
Note:: I have always faced problems when trying to install something mainly because I can't seem to understand Command prompt

If you are more familiar with jupyter notebook, you can install Apache Toree which integrates pyspark,scala,sql and SparkR kernels with Spark.
for installing toree
pip install toree
jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark
if you want to install other kernels you can use
jupyter toree install --interpreters=SparkR,SQl,Scala
Now run
jupyter notebook
In the UI while selecting new notebook, you should see following kernels availble
Apache Toree-Pyspark
Apache Toree-SparkR
Apache Toree-SQL
Apache Toree-Scala

When you unzip the file, a directory is created.
Open a terminal.
Navigate to that directory with cd.
Do an ls. You will see its contents. bin must be placed
somewhere.
Execute bin/pyspark or maybe ./bin/pyspark.
Of course, in practice it's not that simple, you may need to set some paths, like said in TutorialsPoint, but there are plenty of such links out there.

I understand that you have already installed Spark in the windows 10.
You will need to have winutils.exe available as well. If you haven't already done so, download the file from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and install at say, C:\winutils\bin
Set up environment variables
HADOOP_HOME=C:\winutils
SPARK_HOME=C:\spark or wherever.
PYSPARK_DRIVER_PYTHON=ipython or jupyter notebook
PYSPARK_DRIVER_PYTHON_OPTS=notebook
Now navigate to the C:\Spark directory in a command prompt and type "pyspark"
Jupyter notebook will launch in a browser.
Create a spark context and run a count command as shown.

Installing PySpark

I am trying to install PySpark and following the instructions and running this from the command line on the cluster node where I have Spark installed:
$ sbt/sbt assembly
This produces the following error:
-bash: sbt/sbt: No such file or directory
I try the next command:
$ ./bin/pyspark
I get this error:
-bash: ./bin/pyspark: No such file or directory
I feel like I'm missing something basic.
What is missing?
I have spark installed and am able to access it using the command:
$ spark-shell
I have python on the node and am able to open python using the command:
$ python

What's your current working directory? The sbt/sbt and ./bin/pyspark commands are relative to the directory containing Spark's code ($SPARK_HOME), so you should be in that directory when running those commands.
Note that Spark offers pre-built binary distributions that are compatible with many common Hadoop distributions; this may be an easier option if you're using one of those distros.
Also, it looks like you linked to the Spark 0.9.0 documentation; if you're building Spark from scratch, I recommend following the latest version of the documentation.

SBT is used to build a Scala project. If you're new to Scala/SBT/Spark, you're doing things the difficult way.
The easiest way to "install" Spark is to simply download Spark (I recommend Spark 1.6.1 -- personal preference). Then unzip the file in the directory you want to have Spark "installed" in, say C:/spark-folder (Windows) or /home/usr/local/spark-folder (Ubuntu).
After you install it in your desired directory, you need to set your environment variables. Depending on your OS, this will depend; this step is, however, not necessary to run Spark (i.e. pyspark).
If you do not set your environment variables, or don't know how to, an alternative is to simply to go your directory on a terminal window, cd C:/spark-folder (Windows) or cd /home/usr/local/spark-folder (Ubuntu) then type
./bin/pyspark
and spark should run.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.