Configuring Spark to work with Jupyter Notebook and Anaconda

Configuring Spark to work with Jupyter Notebook and Anaconda - python

I've spent a few days now trying to make Spark work with my Jupyter Notebook and Anaconda. Here's what my .bash_profile looks like:
PATH="/my/path/to/anaconda3/bin:$PATH"
export JAVA_HOME="/my/path/to/jdk"
export PYTHON_PATH="/my/path/to/anaconda3/bin/python"
export PYSPARK_PYTHON="/my/path/to/anaconda3/bin/python"
export PATH=$PATH:/my/path/to/spark-2.1.0-bin-hadoop2.7/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
export SPARK_HOME=/my/path/to/spark-2.1.0-bin-hadoop2.7
alias pyspark="pyspark --conf spark.local.dir=/home/puifais --num-executors 30 --driver-memory 128g --executor-memory 6g --packages com.databricks:spark-csv_2.11:1.5.0"
When I type /my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell, I can launch Spark just fine in my command line shell. And the output sc is not empty. It seems to work fine.
When I type pyspark, it launches my Jupyter Notebook fine. When I create a new Python3 notebook, this error appears:
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
And sc in my Jupyter Notebook is empty.
Can anyone help solve this situation?
Just want to clarify: There is nothing after the colon at the end of the error. I also tried to create my own start-up file using this post and I quote here so you don't have to go look there:
I created a short initialization script init_spark.py as follows:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf = conf)
and placed it in the ~/.ipython/profile_default/startup/ directory
When I did this, the error then became:
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
[IPKernelApp] WARNING | Unknown error in handling startup files:

Well, it really gives me pain to see how crappy hacks, like setting PYSPARK_DRIVER_PYTHON=jupyter, have been promoted to "solutions" and tend now to become standard practices, despite the fact that they evidently lead to ugly outcomes, like typing pyspark and ending up with a Jupyter notebook instead of a PySpark shell, plus yet-unseen problems lurking downstream, such as when you try to use spark-submit with the above settings... :(
(Don't get me wrong, it is not your fault and I am not blaming you; I have seen dozens of posts here at SO where this "solution" has been proposed, accepted, and upvoted...).
At the time of writing (Dec 2017), there is one and only one proper way to customize a Jupyter notebook in order to work with other languages (PySpark here), and this is the use of Jupyter kernels.
The first thing to do is run a jupyter kernelspec list command, to get the list of any already available kernels in your machine; here is the result in my case (Ubuntu):
$ jupyter kernelspec list
Available kernels:
python2 /usr/lib/python2.7/site-packages/ipykernel/resources
caffe /usr/local/share/jupyter/kernels/caffe
ir /usr/local/share/jupyter/kernels/ir
pyspark /usr/local/share/jupyter/kernels/pyspark
pyspark2 /usr/local/share/jupyter/kernels/pyspark2
tensorflow /usr/local/share/jupyter/kernels/tensorflow
The first kernel, python2, is the "default" one coming with IPython (there is a great chance of this being the only one present in your system); as for the rest, I have 2 more Python kernels (caffe & tensorflow), an R one (ir), and two PySpark kernels for use with Spark 1.6 and Spark 2.0 respectively.
The entries of the list above are directories, and each one contains one single file, named kernel.json. Let's see the contents of this file for my pyspark2 kernel:
{
"display_name": "PySpark (Spark 2.0)",
"language": "python",
"argv": [
"/opt/intel/intelpython27/bin/python2",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
"PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
"PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
}
}
I have not bothered to change my details to /my/path/to etc., and you can already see that there are some differences between our cases (I use Intel Python 2.7, and not Anaconda Python 3), but hopefully you get the idea (BTW, don't worry about the connection_file - I don't use one either).
Now, the easiest way for you would be to manually do the necessary changes (paths only) to my above shown kernel and save it in a new subfolder of the .../jupyter/kernels directory (that way, it should be visible if you run again a jupyter kernelspec list command). And if you think this approach is also a hack, well, I would agree with you, but it is the one recommended in the Jupyter documentation (page 12):
However, there isn’t a great way to modify the kernelspecs. One approach uses jupyter kernelspec list to find the kernel.json file and then modifies it, e.g. kernels/python3/kernel.json, by hand.
If you don't have already a .../jupyter/kernels folder, you can still install a new kernel using jupyter kernelspec install - haven't tried it, but have a look at this SO answer.
Finally, don't forget to remove all the PySpark-related environment variables from your bash profile (leaving only SPARK_HOME should be OK). And confirm that, when you type pyspark, you find yourself with a PySpark shell, as it should be, and not with a Jupyter notebook...
UPDATE (after comment): If you want to pass command-line arguments to PySpark, you should add the PYSPARK_SUBMIT_ARGS setting under env; for example, here is the last line of my respective kernel file for Spark 1.6.0, where we still had to use the external spark-csv package for reading CSV files:
"PYSPARK_SUBMIT_ARGS": "--master local --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"

Conda can help correctly manage a lot of dependencies...
Install spark. Assuming spark is installed in /opt/spark, include this in your ~/.bashrc:
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
Create a conda environment with all needed dependencies apart from spark:
conda create -n findspark-jupyter-openjdk8-py3 -c conda-forge python=3.5 jupyter=1.0 notebook=5.0 openjdk=8.0.144 findspark=1.1.0
Activate the environment
$ source activate findspark-jupyter-openjdk8-py3
Launch a Jupyter Notebook server:
$ jupyter notebook
In your browser, create a new Python3 notebook
Try calculating PI with the following script (borrowed from this)
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

I just conda installed sparkmagic (after re-installing a newer version of Spark).
I think that alone simply works, and it is much simpler than fiddling configuration files by hand.

Related

Apache Spark's worker python

After installing Apache Spark on 3 nodes on top of Hadoop, I encountered the following problems:
Problem 1- Python version:
I had problem with setting the python on workers. This is the setting in .bashrc file and the same setting is in the spark-env.sh file.
alias python3='/usr/bin/python3'
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
In the Spark logs (yarn logs --applicationId <app_id>) I could see that everything is as expected:
export USER="hadoop"
export LOGNAME="hadoop"
export PYSPARK_PYTHON="python3"
While I installed the pandas library (pip install pandas) on master and worker nodes and made sure it is installed, I constantly received the following message when used the command /home/hadoop/spark/bin/spark-submit --master yarn --deploy-mode cluster sparksql_recommender_system_2.py
ModuleNotFoundError: No module named 'pandas' <br>
Surprisingly this error was just in cluster mode and I didn't have that error in client deployment mode.
The command which python returns /usr/bin/python in which the library pandas exists.
After 2 days I couldn't find my answer on web. By chance, I tried installing pandas using sudo and it worked :).
sudo pip install pandas
However, what I expected was that Spark is going to use the python in /usr/bin/python for the hadoop user, not the root user. How can I fix it?
Problem 2- different behavior of VScode ssh
I use VScode ssh addon to connect to a server on which I develop my codes. When do it from one host (PC) I can use spark-submit, but on my other PC I have to use the exact path /home/hadoop/spark/bin/spark-submit. It is strange because I use VSCode ssh to the same server and files. Any idea how I can solve it?

Here's a great discussion on how to package items up so that your python environment is transferred to executor.
Create the environment
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
Ship it:
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
This does have the disadvantage of having to be shipped every time but it's the safest, and least hassle way to do it. Installing everything on each node is 'faster' but comes with a higher threshold of managment and I suggest avoiding it.
All that said... get off the Pandas. Pandas does python things(small data). Spark Data Frames do Spark things(Big Data). I hope it was just an illustrative example and you aren't going to use Pandas.(It's not a bad! it's just made for small data so use it for small data.) If you "have to" use it, look into Koala's that does a translation to allow you to ask panda things of spark data frames.

Can’t create an environment (Conda or pipenv) that works properly with Jupyter

The long-short:
One day in July, noticed that Jupyter wasn’t importing the version of Seaborn I installed to my Conda env. It was downloading an older Seaborn from a global dir. Same for all other packages when I checked versions. After various attempts at fixing this, Jupyter doesn’t even import packages now. I’ve tried with pipenv too. In both Conda and pipenv, sys.path reveals more path variables than I know what to do with, sometimes including the desired env path, sometimes not. But either way Jupyter imports are ignoring the env path I want to use, instead looking for global packages that I have deleted since July in attempt to solve the issue. On top of this (but probably intertwined in a way I just don’t understand yet), I’m not getting kernels that connect Jupyter to the desired env directories where packages are installed. Global python gets used instead of the env’s python instance. Not sure exactly how kernels are created, but I can tell that they are either not being created for some new envs, not accessing env-specific python & packages, or failing to connect (loads a stale kernel.json file and fails to start.)
Desired outcome:
How do I get JupyterLab to import the intended package version from the intended env directories? Even deeper, How do I get environments back to their former functionality of automatically [1] creating their own python instance, [2] creating their own kernel recognized by Jupyter, [3] creating their own path to the env, and [4] initializing all of that in Jupyter Lab?
Things I have tried:
Deactivated the base env, which admittedly I wasn’t doing for the first few weeks of July until I remembered that is a must for Windows Conda… but still now that I’m regularly deactivating base, why would these issues still persist?
Uninstalled Anaconda and reinstalled Miniconda
Deleted (I believe) all stray / older pythons from my machine, reinstalled a fresh user-level Python 3.9
Verified that packages are installing with the conda list command. They just won’t import properly in JupyterLab
Some tweaks to Path variables in Windows settings, but was very cautious and have no idea how to / if I am revising those properly. Clearly not though seeing as the issue is still alive 3 months later 0.o No idea if I should be editing user variables or system variables, or how to trim the paths cautiously.
Reinstalled jupyter and jupyterlab
Reinstalled ipykernel
Noted that sys.executable, sys.path, and !where python give different outputs in shell python versus Jupyter Lab python
Switching kernels manually in JupyterLab (usually the manually selected kernel is DOA)
Tried manually rewriting sys.path in notebook.
Tried launching JupyterLab from a fresh pipenv instead of Miniconda
Got a pipenv working correctly for a Streamlit applet in August, using raw python in VS Code (This may isolate the issue to Jupyter? I had to continue the other Jupyter-based project in Google collab since JupyterLab started choking in July.)
Deeper details:
Sheesh this post is already getting long… but here I’ve picked out the 4 most prominent code / error snippets. There are more where that came from, but hopefully there might be something in here that you might recognize as the needle in the haystack.
[1]
When the kernel-session.json causes a stillborn kernel, it is created as a totally blank json file. Running jupyter lab in the conda terminal yields this error message, buried in the output:
Failed to load connection file:
'C:\\Users\\David.000\\AppData\\Roaming\\jupyter\\runtime\\kernel-a6082e80-0b65-48f1-b370-7c2918030185.json'
I’ve found past kernel-session.json files that were not stillborn, such as this one called kernel-b146b600-81e3-418e-a55f-5a3fbbc13471.json, which looks like this, auto-populated:
{
"shell_port": 50877,
"iopub_port": 50878,
"stdin_port": 50879,
"control_port": 50880,
"hb_port": 50881,
"ip": "127.0.0.1",
"key": "caa915a4-00a599e8b6c4db6417bcca77",
"transport": "tcp",
"signature_scheme": "hmac-sha256",
"kernel_name": ""
}
[2]
sys.executable, when run in shell python, yields the proper python location within the env:
>>> sys.executable
'C:\\Users\\David.000\\miniconda3\\envs\\aucu_ml\\python.exe'
But sys.executable, when run in Jupyter Lab, is latching onto the global Python:
sys.executable
'C:\\Python39\\python.exe'
[3]
sys.path, when run in shell python, seems to yield the env specific paths I want (though why are there so many??)
>>> sys.path
['', 'C:\\Users\\David.000\\miniconda3\\envs\\aucu_ml\\python39.zip',
'C:\\Users\\David.000\\miniconda3\\envs\\aucu_ml\\DLLs',
'C:\\Users\\David.000\\miniconda3\\envs\\aucu_ml\\lib',
'C:\\Users\\David.000\\miniconda3\\envs\\aucu_ml',
'C:\\Users\\David.000\\AppData\\Roaming\\Python\\Python39\\site-packages',
'C:\\Users\\David.000\\miniconda3\\envs\\aucu_ml\\lib\\site-packages',
'C:\\Users\\David.000\\miniconda3\\envs\\aucu_ml\\lib\\site-packages\\win32',
'C:\\Users\\David.000\\miniconda3\\envs\\aucu_ml\\lib\\site-packages\\win32\\lib',
'C:\\Users\\David.000\\miniconda3\\envs\\aucu_ml\\lib\\site-packages\\Pythonwin']
But sys.path, when run in Jupyter Lab, is also showing a butt-load of paths, many of which are unwanted globals:
sys.path
[> ['C:\\Users\\David.000\\Desktop\\Civic Innovation Corps\\Miami\\Predictive_Analytics_Business_Licensing',
'C:\\Python39\\python39.zip',
'C:\\Python39\\DLLs',
'C:\\Python39\\lib',
'C:\\Python39',
'',
'C:\\Users\\David.000\\AppData\\Roaming\\Python\\Python39\\site-packages',
'C:\\Python39\\lib\\site-packages',
'C:\\Python39\\lib\\site-packages\\win32',
'C:\\Python39\\lib\\site-packages\\win32\\lib',
'C:\\Python39\\lib\\site-packages\\Pythonwin',
'C:\\Python39\\lib\\site-packages\\IPython\\extensions',
'C:\\Users\\David.000\\.ipython']

I'm probably not qualified to answer this but what the heck.
I have several kernels associated with different environments that are configured to run different programs. I created them in the terminal and I make it a practice to install any module/program etc, within the terminal in the activated environment that I need it in instead of installing within the notebook.
If I'm using a desktop, then before opening Juypter I select the appropriate environment within anaconda. If I'm using a virtual machine then within the terminal I activate the environment and once I'm in the notebook, select the appropriate kernel.
While I can't be sure, I think your problem is occurring because of how/where you're installing or updating packages.

Jupyter Book build fails after default create

I'm still learning how this all works, so please bear with me.
I'm running conda 4.8.5 on my Windows 10 machine. I've already installed all necessary Jupyter extensions, I think (Jupyter Lab, Jupyter Notebook, Jupyter Book, Node.js, and their dependencies).
The problem might have to do with the fact that I've installed Miniconda on a separate (D:/) drive.
I've set up a virtual environment (MyEnv) with all the packages I might need for this project. These are the steps I follow:
Launch CMD window
$ conda activate MyEnv
$ jupyter-lab --notebook-dir "Documents/Jupyter Books"
At this point a browser tab opens running Jupyter Lab
From the launcher within Jupyter Lab, open a terminal
$ cd "Documents/Jupyter Books"
$ jb create MyCoolBook
New folder with template book contents gets created in this directory (Yay!)
Without editing anything: $ jb build MyCoolBook
A folder gets added to MyCoolBook called _build, but it doesn't contain much more than a few CSS files.
The terminal throws this error traceback which wasn't very helpful to me. The issue may be obvious to an experienced user.
I am not sure how to proceed. I've reset the entire environment a few times trying to get this to work. What do you suggest? I'm considering submitting a bug report but I want to rule out the very reasonable possibility that I'm being silly.

I asked around in the Github page/forum for Jupyter Book. Turns out it's a matter of text encoding in Windows (I could have avoided this by reading deep into the documentation).
If anyone runs across this issue just know that it can be fixed by reverting to some release, Python 3.7.*, and setting an environment variable (PYTHONUTF8=1) but this is not something I would recommend because some other packages might require the default system encoding. Instead, follow the instructions in this section of the documentation.

JUPYTER_PATH in environment variables not working

I am trying to update JUPYTER_PATH for Jupyter notebook. I set the environment variables following the jupyter documentation but still jupyter contrib nbextension install --user, for example, installed under C:\Users\username\AppData\nbextensions instead of C:\somedir\AppData\Roaming\jupyter\nbextensions.
Added these to my environment variables.
JUPYTER_CONFIG_DIR=C:\somedir\.jupyter
JUPYTER_PATH=C:\somedir\AppData\Roaming\jupyter
JUPYTER_RUNTIME_DIR=C:\somedir\AppData\Roaming\jupyter
jupyter --path shows
PS C:\somedir\> jupyter --path
config:
C:\somedir\.jupyter
C:\anaconda\python27\win64\431\etc\jupyter
C:\ProgramData\jupyter
data:
C:\somedir\AppData\Roaming\jupyter
C:\Users\username\AppData\Roaming\jupyter
C:\anaconda\python27\win64\431\share\jupyter
C:\ProgramData\jupyter
runtime:
C:\somedir\AppData\Roaming\jupyter
jupyter --data-dir shows
jupyter --data-dir
C:\Users\username\AppData\Roaming\jupyter
I think C:\Users\username\AppData\Roaming\jupyter needs to be removed but nor sure how. Can you please help?

To set the user data directory, you should instead use the JUPYTER_DATA_DIR environment variable, in your case set to C:\somedir\AppData\Roaming\jupyter. You can also unset JUPYTER_PATH (see below for details).
Although it's not terribly obvious from documentation, the nbextension install command takes no notice of the JUPYTER_PATH environment variable, since it doesn't use the jupyter_core.paths.jupyter_path function, but uses jupyter_core.paths.jupyter_data_dir to construct the user-data nbextensions directory directly.
The entry C:\Users\username\AppData\Roaming\jupyter from the data section of the output of jupyter --paths is the user data directory, since JUPYTER_PATH is used in addition to other entries, rather than replacing any. For your purposes, I suggest you unset JUPYTER_PATH, since you can get what you want without it.

how to use spark with python or jupyter notebook

I am trying to work with 12GB of data in python for which I desperately need to use Spark , but I guess I'm too stupid to use command line by myself or by using internet and that is why I guess I have to turn to SO ,
So by far I have downloaded the spark and unzipped the tar file or whatever that is ( sorry for the language but I am feeling stupid and out ) but now I can see nowhere to go. I have seen the instruction on spark website documentation and it says :
Spark also provides a Python API. To run Spark interactively in a Python interpreter, use bin/pyspark but where to do this ? please please help .
Edit : I am using windows 10
Note:: I have always faced problems when trying to install something mainly because I can't seem to understand Command prompt

If you are more familiar with jupyter notebook, you can install Apache Toree which integrates pyspark,scala,sql and SparkR kernels with Spark.
for installing toree
pip install toree
jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark
if you want to install other kernels you can use
jupyter toree install --interpreters=SparkR,SQl,Scala
Now run
jupyter notebook
In the UI while selecting new notebook, you should see following kernels availble
Apache Toree-Pyspark
Apache Toree-SparkR
Apache Toree-SQL
Apache Toree-Scala

When you unzip the file, a directory is created.
Open a terminal.
Navigate to that directory with cd.
Do an ls. You will see its contents. bin must be placed
somewhere.
Execute bin/pyspark or maybe ./bin/pyspark.
Of course, in practice it's not that simple, you may need to set some paths, like said in TutorialsPoint, but there are plenty of such links out there.

I understand that you have already installed Spark in the windows 10.
You will need to have winutils.exe available as well. If you haven't already done so, download the file from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and install at say, C:\winutils\bin
Set up environment variables
HADOOP_HOME=C:\winutils
SPARK_HOME=C:\spark or wherever.
PYSPARK_DRIVER_PYTHON=ipython or jupyter notebook
PYSPARK_DRIVER_PYTHON_OPTS=notebook
Now navigate to the C:\Spark directory in a command prompt and type "pyspark"
Jupyter notebook will launch in a browser.
Create a spark context and run a count command as shown.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.