How to run pySpark on Hadoop

How to run pySpark on Hadoop - python

I am just new in Hadoop world.
I am going to install a standalone version of Hadoop on my PC to save files on HDFS (of course 1 node) and then run pySpark to read files from HDFS and process them. I have no clue how can I put these pieces next together.
Can anyone please give me a crystal clear order of components that I need to install?

If you are using windows PC then you have to install VM player or
Oracle virtual box then
1.a. install any of linux distribution e.g. centos, rhel, ubuntu etc in your VM
1.b. install JAVA in your VM
1.c follow from step 2.b onwards
If you are using Linux machine then
2a. install JAVA
2b download stable version of apache hadoop
2.c then extract tar file in /usr/your/directory
2.d make configurations in your ~/.bash_profile for your hadoop path
e.g.export HADOOP_HOME=/opt/hadoop export HADOOP_COMMON_HOME=$HADOOP_HOME
2.e make configurations in core-site.xml hdfs-site.xml mapred-site.xml and yarn-site.xml by following this must have properties for core-site hdfs-site mapred-site and yarn-site.xml
2.f Format your name node and then start rest daemons
NOTE: follow steps given in for installing single node cluster OR APACHE Documentation
after installing and configuring hadoop in your PC
3.a download Apache spark
3.b extract tar file and follow same instructions like exporting path in bash_profile file
3.c start spark shell or pyspark shell
NOTE: follow steps for installing spark

If you have windows 10 Pro then you can install Ubuntu WSL 20.04(https://learn.microsoft.com/en-us/windows/wsl/install-win10#manual-installation-steps) then install Hadoop(https://dev.to/samujjwaal/hadoop-installation-on-windows-10-using-wsl-2ck1)
In this moment have two parts: one management for hdfs files and one management for applications.
So you can use the hadoop management for hdfs files to store/host any massive file which can copy condensed and then can be uncompressed in hadoop-drive-zone.
And then can install python(pySpark) or java or other language to create one app which will manage with the hadoop apps management.
The hadoop appication(let say pySpark) will can access any massive file stored into hadoop-drive-zone.
For installing the Apache Spark as alternative you can proceed like in this tutorial:
https://dev.to/awwsmm/installing-and-running-hadoop-and-spark-on-ubuntu-18-393h
or
https://kontext.tech/column/spark/311/apache-spark-243-installation-on-windows-10-using-windows-subsystem-for-linux
Remark: see that the above tutorial install the Hadoop 3.3.0 version(try it with ubuntu command: $ hadoop version) and in Spark installing tutorials must to choice/use adequate Spark version!

Related

PySpark 2.4 is not starting

I have been using PySpark 2.4 for some time. Spark was installed in my system at /usr/local/spark. Suddenly I see the error when I type pyspark in command line.
Failed to find Spark jars directory (/assembly/target/scala-2.12/jars).
You need to build Spark with the target "package" before running this program.
System OS = CentOS 7
I have Python 2.7 and Python3.6 installed. However, when I installed spark, then I set Python 3.6 as default.
There are couple of cronjobs that run every day (and have been running for quite sometimes) where pyspark is used. I am afraid such error like above may hinder to run those crons.
Please throw some lights on this.

No module named 'resource' installing Apache Spark on Windows

I am trying to install apache spark to run locally on my windows machine. I have followed all instructions here https://medium.com/#loldja/installing-apache-spark-pyspark-the-missing-quick-start-guide-for-windows-ad81702ba62d.
After this installation I am able to successfully start pyspark, and execute a command such as
textFile = sc.textFile("README.md")
When I then execute a command that operates on textFile such as
textFile.first()
Spark gives me the error 'worker failed to connect back', and I can see an exception in the console coming from worker.py saying 'ModuleNotFoundError: No module named resource'. Looking at the source file I can see that this python file does indeed try to import the resource module, however this module is not available on windows systems. I understand that you can install spark on windows so how do I get around this?

I struggled the whole morning with the same problem. Your best bet is to downgrade to Spark 2.3.2

The fix can be found at https://github.com/apache/spark/pull/23055.
The resource module is only for Unix/Linux systems and is not applicaple in a windows environment. This fix is not yet included in the latest release but you can modify the worker.py in your installation as shown in the pull request. The changes to that file can be found at https://github.com/apache/spark/pull/23055/files.
You will have to re-zip the pyspark directory and move it the lib folder in your pyspark installation directory (where you extracted the pre-compiled pyspark according to the tutorial you mentioned)

Adding to all those valuable answers,
For windows users,Make sure you have copied the correct version of the winutils.exe file(for your specific version of Hadoop) to the spark/bin folder
Say,
if you have Hadoop 2.7.1, then you should copy the winutils.exe file from the Hadoop 2.7.1/bin folder
Link for that is here
https://github.com/steveloughran/winutils

I edited worker.py file. Removed all resource-related lines. Actually # set up memory limits block and import resource. The error disappeared.

PySpark Will not start - ‘python’: No such file or directory

I'm trying to setup pyspark on my desktop and interact with it via the terminal.
I'm following this guide,
http://jmedium.com/pyspark-in-python/
When I run 'pyspark' in the terminal is says,
/home/jacob/spark-2.1.0-bin-hadoop2.7/bin/pyspark: line 45: python:
command not found
env: ‘python’: No such file or directory
I've followed several guides which all lead to this same issue (some have different details on setting up the .profile. Thus far none have worked correctly).
I have java, python3.6, and Scala installed.
My .profile is configured as follows:
#Spark and PySpark Setup
PATH="$HOME/bin:$HOME/.local/bin:$PATH"
export SPARK_HOME='/home/jacob/spark-2.1.0-bin-hadoop2.7'
export PATH=$SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
#export PYSPARK_DRIVER_PYTHON="jupyter"
#export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3.6.5
Note that jupyter notebook is commented out because I want to launch pyspark in the shell right now with out the notebook starting
Interestingly spark-shell launches just fine
I'm using Ubuntu 18.04.1 and
Spark 2.1
See Images
I've tried every guide I can find, and since this is my first time setting up Spark i'm not sure how to troubleshoot it from here
Thank you
Attempting to execute pyspark
.profile
versions

You should have set export PYSPARK_PYTHON=python3 instead of export PYSPARK_PYTHON=python3.6.5 in your .profile
then source .profile , of course.
That's worked for me.
other options, installing sudo apt python (which is for 2.x ) is not appropriate.

For those who may come across this, I figured it out!
I specifically chose to use an older version of Spark in order to follow along with a tutorial I was watching - Spark 2.1.0. I did not know that the latest version of Python (3.5.6 at the time of writing this) is incompatible with Spark 2.1. Thus PySpark would not launch.
I solved this by using Python 2.7 and setting the path accordingly in .bashrc
export PYTHONPATH=$PYTHONPAH:/usr/lib/python2.7
export PYSPARK_PYTHON=python2.7

People using python 3.8 and Spark <= 2.4.5 will have the same problem.
In this case, the only solution I found is to update spark to V 3.0.0.
Look at https://bugs.python.org/issue38775

for GNU/Linux users that have python3 package installed (ubuntu/debian distro's specially) you can find a package called "python-is-python3" this would help identifying python3 as python command.
# apt install python-is-python3
python 2.7 is deprecated now (2020 ubuntu 20.10) so do not try installing it.

I have already solved this issue. Just type this command:
sudo apt install python

run pyspark locally

I tried to follow the instructions in this book:
Large Scale Machine Learning with Python
It uses a VM image to run Spark involving Oracle VM VirtualBox and Vagrant. I almost managed to get the VM working but am blocked by not having the permission to switch virtualization on in the BIOS (I do not have the password and doubt my employer's IT department will allow me to switch this on). See also discussion here.
Anyway, what other options do I have to play with sparkpy locally (have installed it locally)? My first aim is to get this Scala code:
scala> val file = sc.textFile("C:\\war_and_peace.txt")
scala> val warsCount = file.filter(line => line.contains("war"))
scala> val peaceCount = file.filter(line => line.contains("peace"))
scala> warsCount.count()
res0: Long = 1218
scala> peaceCount.count()
res1: Long = 128
running in Python. Any pointers would be very much appreciated.

So you can setup Spark with the python and scala shells on windows, but the caveat is that in my experience performance on windows is inferior to that of osx and linux. If you want to go the route of setting everything up on windows I made a short write up of the instructions for this not too long ago that you can check out here. I am pasting the text below just incase I ever move the file from that repo or the links breaks for some other reason.
Download and Extract Spark
Download latest release of spark from apache.
Be aware that it is critical that you get the right Hadoop binaries for the version of spark you choose. See section on Hadoop binaries below.
Extract with 7-zip.
Install Java and Python
Install latest version of 64-bit Java.
Install Anaconda3 Python 3.5, 64-bit (or other version of your choice) for all users. Restart server.
Test Java and Python
Open command line and type java -version. If it is installed properly you will see an output like this:
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
Type either python or python --version.
The first will open the python shell after showing the version information. The second will show only version information similar to this:
Python 3.5.2 :: Anaconda 4.2.0 (64-bit)
Download Hadoop binary for Windows 64-bit
You likely don't have Hadoop installed on windows, but spark will deep within its core look for this file and possibly other binaries. Thankfully a Hadoop contributor has compiled these and has a repository with binaries for Hadoop 2.6. These binaries will work for spark version 2.0.2, but will not work with 2.1.0. To use spark 2.1.0 download the binaries from here.
The best tactic for this is to clone the repo and keep the Hadoop folder corresponding to your version of spark and add the hadoop-%version% folder to your path as HADOOP_HOME.
Add Java and Spark to Environment
Add the path to java and spark as environment variables JAVA_HOME and SPARK_HOME respectively.
Test pyspark
In command line, type pyspark and observe output. At this point spark should start in the python shell.
Setup pyspark to use Jupyter notebook
Instructions for using interactive python shells with pyspark exist within the pyspark code and can be accessed through your editor. In order to use the Jupyter notebook before launching pyspark type the following two commands:
set PYSPARK_DRIVER_PYTHON=jupyter
set PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Once those variables are set pyspark will launch in the Jupyter notebook with the default SparkContext initialized as sc and the SparkSession initialized as spark. ProTip: Open http://127.0.0.1:4040 to view the spark UI which includes lots of useful information about your pipeline and completed processes. Any additional notebooks open with spark running will be in consecutive ports i.e. 4041, 4042, etc...
The jist is that getting the right versions of the Hadoop binary files for your version of spark is critical. The rest is making sure your path and environment variable are properly configured.

Why do Python scripts on HDInsight fail with 'No module named numpy'?

I've created a HDInsight cluster with Apache Spark using Script Actions as described in Install and use Spark 1.0 on HDInsight Hadoop clusters:
You can install Spark on any type of cluster in Hadoop on HDInsight using Script Action cluster customization. Script action lets you run scripts to customize a cluster, only when the cluster is being created. For more information, see Customize HDInsight cluster using script action.
I have ran a basic Python (word count sample) script that worked, but when I start a Python script that uses NumPy I get this importer error: 'No module named numpy' raised on the nodes.
Why can't I import the package since NumPy was (supposed to be) installed out-of-the-box on a HDInsight cluster? Is there a way to install NumPy on the nodes? HDInsight doesn't allow you any access to the nodes.

You can use Script Actions to apply custom packages to all the data nodes in an HDInsight cluster. The docs are at http://acom-sandbox.azurewebsites.net/en-us/documentation/articles/hdinsight-hadoop-customize-cluster/
Roughly speaking what you want to do is create your cluster in PowerShell and include something like:
$config = Add-AzureHDInsightScriptAction -Config $config –Name MyScriptActionName –Uri http://uri.to/scriptaction.ps1 –Parameters MyScriptActionParameter -ClusterRoleCollection HeadNode,DataNode
The script at http://uri.to/scriptaction.ps1 can easily be stored on blob storage, and is run on the node types specified. That's script you would use to install any custom python (or other) packages.

You can use custom script as mentioned in the answers, however the below command worked for me in Hbase - Hdinsight Cluster. (It should work in Hadoop - Hdinsight Cluster as well.)
sudo apt-get install python-numpy

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.