how to use spark with python or jupyter notebook - python

I am trying to work with 12GB of data in python for which I desperately need to use Spark , but I guess I'm too stupid to use command line by myself or by using internet and that is why I guess I have to turn to SO ,
So by far I have downloaded the spark and unzipped the tar file or whatever that is ( sorry for the language but I am feeling stupid and out ) but now I can see nowhere to go. I have seen the instruction on spark website documentation and it says :
Spark also provides a Python API. To run Spark interactively in a Python interpreter, use bin/pyspark but where to do this ? please please help .
Edit : I am using windows 10
Note:: I have always faced problems when trying to install something mainly because I can't seem to understand Command prompt

If you are more familiar with jupyter notebook, you can install Apache Toree which integrates pyspark,scala,sql and SparkR kernels with Spark.
for installing toree
pip install toree
jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark
if you want to install other kernels you can use
jupyter toree install --interpreters=SparkR,SQl,Scala
Now run
jupyter notebook
In the UI while selecting new notebook, you should see following kernels availble
Apache Toree-Pyspark
Apache Toree-SparkR
Apache Toree-SQL
Apache Toree-Scala

When you unzip the file, a directory is created.
Open a terminal.
Navigate to that directory with cd.
Do an ls. You will see its contents. bin must be placed
somewhere.
Execute bin/pyspark or maybe ./bin/pyspark.
Of course, in practice it's not that simple, you may need to set some paths, like said in TutorialsPoint, but there are plenty of such links out there.

I understand that you have already installed Spark in the windows 10.
You will need to have winutils.exe available as well. If you haven't already done so, download the file from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and install at say, C:\winutils\bin
Set up environment variables
HADOOP_HOME=C:\winutils
SPARK_HOME=C:\spark or wherever.
PYSPARK_DRIVER_PYTHON=ipython or jupyter notebook
PYSPARK_DRIVER_PYTHON_OPTS=notebook
Now navigate to the C:\Spark directory in a command prompt and type "pyspark"
Jupyter notebook will launch in a browser.
Create a spark context and run a count command as shown.

Related

Can i work locally on VS Code with a virtual environment while ssh'ed to google colab

I want to know if it's possible to work in a virtual environment while ssh'ed to google colab. I tried ssh'ing to google colab and did it but when I was going to code a .ipynb file I needed to select the kernel and when I tried selecting the one from my Conda Virtual Environment it did nothing. Wanted to know if it's possible or if I did something wrong. If you know some guide or video that teaches how to do this link it if possible, I already searched but found nothing. Thanks
After ssh'ing to google colab i tried inserting my conda env kernel in the VS Code kernel button but it did nothing so i couldn't run my .ipynb file
I'm not sure if it helps, but if you want to work on a running Jupyter server using VScode, you need to connect to that remote jupyter server.
Here's the link to doc https://code.visualstudio.com/docs/datascience/jupyter-notebooks#_connect-to-a-remote-jupyter-server
But also I'm not sure if collab supports plugins which vscode needs to install to run Jupyter remotely. But you should definitely try and install the Jupyter add-on first.

Jupyter Book build fails after default create

I'm still learning how this all works, so please bear with me.
I'm running conda 4.8.5 on my Windows 10 machine. I've already installed all necessary Jupyter extensions, I think (Jupyter Lab, Jupyter Notebook, Jupyter Book, Node.js, and their dependencies).
The problem might have to do with the fact that I've installed Miniconda on a separate (D:/) drive.
I've set up a virtual environment (MyEnv) with all the packages I might need for this project. These are the steps I follow:
Launch CMD window
$ conda activate MyEnv
$ jupyter-lab --notebook-dir "Documents/Jupyter Books"
At this point a browser tab opens running Jupyter Lab
From the launcher within Jupyter Lab, open a terminal
$ cd "Documents/Jupyter Books"
$ jb create MyCoolBook
New folder with template book contents gets created in this directory (Yay!)
Without editing anything: $ jb build MyCoolBook
A folder gets added to MyCoolBook called _build, but it doesn't contain much more than a few CSS files.
The terminal throws this error traceback which wasn't very helpful to me. The issue may be obvious to an experienced user.
I am not sure how to proceed. I've reset the entire environment a few times trying to get this to work. What do you suggest? I'm considering submitting a bug report but I want to rule out the very reasonable possibility that I'm being silly.
I asked around in the Github page/forum for Jupyter Book. Turns out it's a matter of text encoding in Windows (I could have avoided this by reading deep into the documentation).
If anyone runs across this issue just know that it can be fixed by reverting to some release, Python 3.7.*, and setting an environment variable (PYTHONUTF8=1) but this is not something I would recommend because some other packages might require the default system encoding. Instead, follow the instructions in this section of the documentation.

How to run Python and Jupyter with same virtual env, working with Visual Studio Code

For my current job it would be extremely helpful to be able to configure a virtualenv with the appropiate libraries versions, and be able to run either a python project, or cells in jupyter. This is because some people at my job work with jupyter, and some with Python, and sometimes both, and this way I would have a centralized program that could run both types, which I have not found outsaid of the paid version of PyCharm, which my company does not provide.
I just learned a few days ago about Windows Subsystem for Linux, WSL, and that it can be launched from withing Visual Studio Code, so I feel like this is my best bet to achieve that dual nature of programming from just ONE program, instead of running several like in the past.
As of right now, I have a repository cloned with WSL for a git project with different ".py" files, I open it with VSC, then open the terminal inside VSC, and I can both edit the Python code, and run it on the terminal, using bash commands as I would if I were in Ubuntu (I am doing all this from windows but can switch to Ubuntu if it would mean to be able to do this type of setup).
When I run with "Run Python file in terminal", it uses the virtualenv I have previously created.
The problem is, with Jupyter, it does not detect I have the libraries installed (like Pandas for example)
Description of my process with Jupyter so far: With the WSL console, I launch a jupyter notebook &. I then connect to that server, usin the VSC option for "Specify Local or Remote Jupyter server for connections", use the "Existing" option, copy the URL, then I go to the ".ipynb" file and start running code.
If, in a Jupyter cell, I do
import os
os.environ['VIRTUAL_ENV']
I can see my virtual enviroment. If right after that, I run import pandas I get ModuleNotFoundError: No module named 'pandas'
If I do !pip freeze I can see all the libraries and right versions that I have installed in that enviroment.
I feel like I am almost there but something is missing. My guess is that import might be going to some default installed Python, and not the one from the enviroment, for some reason I am missing.
Solved by doing what this answer suggests. The rest of answers may be of help, too:
https://stackoverflow.com/a/51036073/6028947

Running pyspark in (Anaconda - Spyder) in windows OS

Dears,
I am using windows 10 and I am familiar with testing my python code in Spyder.
however, when I am trying to write ïmport pyspark" command, Spyder showing "No module named 'pyspark'"
Pyspark is installed in my PC and also I can do import pyspark in command prompt without any error.
I found many blogs explaining how to do this in Ubuntu but I did not find how to solve it in windows.
Well for using packages in Spyder, you have to install them through Anaconda. You can open
"anaconda prompt" and the write down the blew code:
conda install pyshark
That will give you the package available in SPYDER.
Hi I have installed Pyspark in windows 10 few weeks back. Let me tell you how I did it.
I followed "https://changhsinlee.com/install-pyspark-windows-jupyter/".
So after following each step precisely you can able to run pyspark using either command promp or saving a python file and running.
When you run via notebook(download Anaconda). start anacoda shell and type pyspark. now you don't need to do "ïmport pyspark".
run your program without this and it will be alright. you can also do spark-submit but for that I figured out that you need to remove the PYSPARK_DRIVER_PATH and OPTS PATH in environment variable.

How to specify python3 kernel in jupyter in pyCharm?

Here is my setting
and this is my script
I am trying to use jupyter notebook in pyCharm, but it kept using python2 instead of python3.
Any idea about this problem?
Add:
this pic is running jupyter notebook in chrome.
My problem was that I had multiple kernels, and PyCharm launches the default kernel. One approach might be to configure PyCharm to specify the kernel of choice to start up, I didn't investigate how to do that. I simply changed the default kernel in Jupyter and this worked for me (I have a virtualenv for tensorflow). c.MultiKernelManager.default_kernel_name = 'tensorflow'.
The preferences image you show is indeed how you would setup your interpreter for PyCharm, but that's not what the output/logging of PyCharm looks like. I'm guessing that's a jupyter-notebook display, which means you are running into the issue in jupyter-notebook and not PyCharm. So you need to change your setup for jupyter. Based on some quick searching pip install jupyter will install a python 2.7 version of jupyter. Sounds like what you want is
pip3 install jupyter
which will install the python3 version for you. You will likely have to uninstall your current version of jupyter.
When you kick off Jupyter-notebook from within PyCharm there is a configuration which is created. If the configuration is initially 2.7 ( I think it defaults to the current interpreter), and then keep using that same configuration, it wouldn't matter the state of the current project interpreter because it would be using the value saved in the run configuration.
You can modify your run configuration by
Run | Run...
Edit Configurations...
Select your Jupyter Notebook run configuration on the left (here is untitled4)
Make sure the python interpreter is correct here on the right
I was able to start a jupyter notebook like this and get it to output python 3 by doing this. Hope this is what you are needing.

Categories

Resources