PySpark 2.4 is not starting

PySpark 2.4 is not starting - python

I have been using PySpark 2.4 for some time. Spark was installed in my system at /usr/local/spark. Suddenly I see the error when I type pyspark in command line.
Failed to find Spark jars directory (/assembly/target/scala-2.12/jars).
You need to build Spark with the target "package" before running this program.
System OS = CentOS 7
I have Python 2.7 and Python3.6 installed. However, when I installed spark, then I set Python 3.6 as default.
There are couple of cronjobs that run every day (and have been running for quite sometimes) where pyspark is used. I am afraid such error like above may hinder to run those crons.
Please throw some lights on this.

Related

Python 3.10 is the default PyCharm update for pyspark/spark 3.0.0.preview - are these two compatible?

There is documentation that indicates Python 3.9 is the last version compatible with Spark 3.0.0.preview. For someone new at setting up compatibility on Windows using Hadoop, Spark, Scala, Python, and PyCharm, the number of possibilities for compatibility is daunting and while browsing through the first few, given the course suggestions, it is trying to find python and spark compatibility. The version of interpreter used for the Anaconda3 is 3.8.8.
In the image below, it would appear from the school of hard knocks, the compatibility may not exist. The downloads are not indicative of versioning for Python. The Scala with the course is 2.11. The latest release also indicates 2.12 is needed for Scala (same link). Somewhere within all these choices (Using Windows 10), there is a solution but it is elusive. Hadoop version seems to be an issue as well. (Hadoop 2.7)
Other compatibility issues occur prior to setting Edit Configurations for each project and include missing files or access errors (this is a company machine - privileged management - but not full admin rights)
From the thread dump, there appears to be a log of information but being new again, it is difficult to sort through all the debug information. Possibly the JDK is wrong? but it is the one suggested to use with JRE1.8.0_201. I saw some issues as well about the possible space in "Program Files" being problematic with the JAVA_HOME path. Java did not seem happy when during one of the trial setups, it was installed to a different directory.
Anaconda3>pycharm
CompileCommand: exclude com/intellij/openapi/vfs/impl/FilePartNodeRoot.trieDescend bool exclude = true
2022-11-04 08:40:31,040 [ 1132] WARN - #c.i.o.f.i.FileTypeManagerImpl -
com.adacore.adaintellij.file.AdaSpecFileType#4f671e00 from 'PluginDescriptor(name=Ada, id=com.adacore.Ada-IntelliJ, descriptorPath=plugin.xml, path=~\AppData\Roaming\JetBrains\PyCharmCE2022.2\plugins\Ada-IntelliJ, version=0.6-dev, package=null, isBundled=false)' (class com.adacore.adaintellij.file.AdaSpecFileType) and
com.adacore.adaintellij.file.AdaBodyFileType#22a64016 from 'PluginDescriptor(name=Ada, id=com.adacore.Ada-IntelliJ, descriptorPath=plugin.xml, path=~\AppData\Roaming\JetBrains\PyCharmCE2022.2\plugins\Ada-IntelliJ, version=0.6-dev, package=null, isBundled=false)' (class com.adacore.adaintellij.file.AdaBodyFileType)
both have the same .getDisplayName(): 'Ada'. Please override either one's getDisplayName() to something unique.
com.intellij.diagnostic.PluginException:
com.adacore.adaintellij.file.AdaSpecFileType#4f671e00 from 'PluginDescriptor(name=Ada, id=com.adacore.Ada-IntelliJ, descriptorPath=plugin.xml, path=~\AppData\Roaming\JetBrains\PyCharmCE2022.2\plugins\Ada-IntelliJ, version=0.6-dev, package=null, isBundled=false)' (class com.adacore.adaintellij.file.AdaSpecFileType) and
com.adacore.adaintellij.file.AdaBodyFileType#22a64016 from 'PluginDescriptor(name=Ada, id=com.adacore.Ada-IntelliJ, descriptorPath=plugin.xml, path=~\AppData\Roaming\JetBrains\PyCharmCE2022.2\plugins\Ada-IntelliJ, version=0.6-dev, package=null, isBundled=false)' (class com.adacore.adaintellij.file.AdaBodyFileType)
both have the same .getDisplayName(): 'Ada'. Please override either one's getDisplayName() to something unique.
2022-11-04 08:40:45,822 [ 15914] SEVERE - #c.i.u.m.i.MessageBusImpl - PyCharm 2022.2.3 Build #PC-222.4345.23
2022-11-04 08:40:45,825 [ 15917] SEVERE - #c.i.u.m.i.MessageBusImpl - JDK: 17.0.4.1; VM: OpenJDK 64-Bit Server VM; Vendor: JetBrains s.r.o.
2022-11-04 08:40:45,826 [ 15918] SEVERE - #c.i.u.m.i.MessageBusImpl - OS: Windows 10
[1]: https://spark.apache.org/downloads.html
[2]: https://i.stack.imgur.com/C6BGc.png

First, don't use preview releases. It's been over 2 years since Spark 3 was released; use at least the latest minor release.
That being said, Spark 3 mostly is meant to be used with Hadoop 3. All should work fine on Windows, using Java 11 (your logs say 17)
You can use Scala 2.12 or 2.13 for Spark.
Pyspark support should be fine on Python 3.9. Don't use Anaconda if you don't need it. Download Python directly. pip install pyspark. That's it. You don't even need Hadoop to run Spark code.
And unclear why you're trying to run Pycharm from the terminal. Start with spark-shell, then if that works, then you can run pyspark. Then you can use spark-submit. Only once those work, should you realistically move towards an IDE.
Alternatively, don't pollute your host machine with a bunch of software, install Docker instead, and use that to run Jupyter with Spark pre-configured - https://jupyter-docker-stacks.readthedocs.io/en/latest/using/running.html

PySpark Will not start - ‘python’: No such file or directory

I'm trying to setup pyspark on my desktop and interact with it via the terminal.
I'm following this guide,
http://jmedium.com/pyspark-in-python/
When I run 'pyspark' in the terminal is says,
/home/jacob/spark-2.1.0-bin-hadoop2.7/bin/pyspark: line 45: python:
command not found
env: ‘python’: No such file or directory
I've followed several guides which all lead to this same issue (some have different details on setting up the .profile. Thus far none have worked correctly).
I have java, python3.6, and Scala installed.
My .profile is configured as follows:
#Spark and PySpark Setup
PATH="$HOME/bin:$HOME/.local/bin:$PATH"
export SPARK_HOME='/home/jacob/spark-2.1.0-bin-hadoop2.7'
export PATH=$SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
#export PYSPARK_DRIVER_PYTHON="jupyter"
#export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3.6.5
Note that jupyter notebook is commented out because I want to launch pyspark in the shell right now with out the notebook starting
Interestingly spark-shell launches just fine
I'm using Ubuntu 18.04.1 and
Spark 2.1
See Images
I've tried every guide I can find, and since this is my first time setting up Spark i'm not sure how to troubleshoot it from here
Thank you
Attempting to execute pyspark
.profile
versions

You should have set export PYSPARK_PYTHON=python3 instead of export PYSPARK_PYTHON=python3.6.5 in your .profile
then source .profile , of course.
That's worked for me.
other options, installing sudo apt python (which is for 2.x ) is not appropriate.

For those who may come across this, I figured it out!
I specifically chose to use an older version of Spark in order to follow along with a tutorial I was watching - Spark 2.1.0. I did not know that the latest version of Python (3.5.6 at the time of writing this) is incompatible with Spark 2.1. Thus PySpark would not launch.
I solved this by using Python 2.7 and setting the path accordingly in .bashrc
export PYTHONPATH=$PYTHONPAH:/usr/lib/python2.7
export PYSPARK_PYTHON=python2.7

People using python 3.8 and Spark <= 2.4.5 will have the same problem.
In this case, the only solution I found is to update spark to V 3.0.0.
Look at https://bugs.python.org/issue38775

for GNU/Linux users that have python3 package installed (ubuntu/debian distro's specially) you can find a package called "python-is-python3" this would help identifying python3 as python command.
# apt install python-is-python3
python 2.7 is deprecated now (2020 ubuntu 20.10) so do not try installing it.

I have already solved this issue. Just type this command:
sudo apt install python

Can not connect python to h2o instance (version mismatch)

I have just run an upgrade on the h2o package in python, but am only getting a version of 3.10.4.1. However, my recently upgraded h2o application is running 3.10.4.6 - can you please help me rectify this discrepancy? Thanks in advance.

An H2O version mismatch is caused when the H2O Java application and the h2o Python module (or R package) have different version numbers. If you only use the h2o Python module, this will not happen. However, if you launch an H2O cluster from the command line, java -jar h2o.jar, and then connect to it via the h2o Python module, the version numbers can be in disagreement.
If this happens, the best way to resolve is to kill the existing Java process and start the H2O cluster from inside the h2o Python module. Alternatively, you can pip uninstall h2o, visit the H2O Downloads page and install the matching version of the Python package.

This has been resolved. It was a path mixup which I have fixed and now can see the same h2o.version from the cmd and jupyter.

Here is another solution: Best way is to match both the versions.
Try to be on the latest. Few things are tricky while matching the versions. When you are upgrading the one that has lower version, make sure you do h2o.shutdown() to shut the h2o else, the uninstall or install won't happen successfully.
And then go to h2o-ai and follow the steps

run pyspark locally

I tried to follow the instructions in this book:
Large Scale Machine Learning with Python
It uses a VM image to run Spark involving Oracle VM VirtualBox and Vagrant. I almost managed to get the VM working but am blocked by not having the permission to switch virtualization on in the BIOS (I do not have the password and doubt my employer's IT department will allow me to switch this on). See also discussion here.
Anyway, what other options do I have to play with sparkpy locally (have installed it locally)? My first aim is to get this Scala code:
scala> val file = sc.textFile("C:\\war_and_peace.txt")
scala> val warsCount = file.filter(line => line.contains("war"))
scala> val peaceCount = file.filter(line => line.contains("peace"))
scala> warsCount.count()
res0: Long = 1218
scala> peaceCount.count()
res1: Long = 128
running in Python. Any pointers would be very much appreciated.

So you can setup Spark with the python and scala shells on windows, but the caveat is that in my experience performance on windows is inferior to that of osx and linux. If you want to go the route of setting everything up on windows I made a short write up of the instructions for this not too long ago that you can check out here. I am pasting the text below just incase I ever move the file from that repo or the links breaks for some other reason.
Download and Extract Spark
Download latest release of spark from apache.
Be aware that it is critical that you get the right Hadoop binaries for the version of spark you choose. See section on Hadoop binaries below.
Extract with 7-zip.
Install Java and Python
Install latest version of 64-bit Java.
Install Anaconda3 Python 3.5, 64-bit (or other version of your choice) for all users. Restart server.
Test Java and Python
Open command line and type java -version. If it is installed properly you will see an output like this:
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
Type either python or python --version.
The first will open the python shell after showing the version information. The second will show only version information similar to this:
Python 3.5.2 :: Anaconda 4.2.0 (64-bit)
Download Hadoop binary for Windows 64-bit
You likely don't have Hadoop installed on windows, but spark will deep within its core look for this file and possibly other binaries. Thankfully a Hadoop contributor has compiled these and has a repository with binaries for Hadoop 2.6. These binaries will work for spark version 2.0.2, but will not work with 2.1.0. To use spark 2.1.0 download the binaries from here.
The best tactic for this is to clone the repo and keep the Hadoop folder corresponding to your version of spark and add the hadoop-%version% folder to your path as HADOOP_HOME.
Add Java and Spark to Environment
Add the path to java and spark as environment variables JAVA_HOME and SPARK_HOME respectively.
Test pyspark
In command line, type pyspark and observe output. At this point spark should start in the python shell.
Setup pyspark to use Jupyter notebook
Instructions for using interactive python shells with pyspark exist within the pyspark code and can be accessed through your editor. In order to use the Jupyter notebook before launching pyspark type the following two commands:
set PYSPARK_DRIVER_PYTHON=jupyter
set PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Once those variables are set pyspark will launch in the Jupyter notebook with the default SparkContext initialized as sc and the SparkSession initialized as spark. ProTip: Open http://127.0.0.1:4040 to view the spark UI which includes lots of useful information about your pipeline and completed processes. Any additional notebooks open with spark running will be in consecutive ports i.e. 4041, 4042, etc...
The jist is that getting the right versions of the Hadoop binary files for your version of spark is critical. The rest is making sure your path and environment variable are properly configured.

ImportError: No module named 'resource'

I am using python 3.5 and I am doing Algorithms specialization courses on Coursera. Professor teaching this course posted a program which can help us to know the time and memory associated with running a program. It has import resource command at the top. I tried to run this program along with the programs I have written in python and every time I received ImportError: No module named 'resource'
I used the same code in ubuntu and have no errors at all.
I followed suggestions in stackoverflow answers and I have tried adding PYTHONPATH PYTHONHOME and edited the PATH environment variable.
I have no idea of what else I can do here.
Is there any file that I can download and install it in the Lib or site-packages folder of my python installation ?

resource is a Unix specific package as seen in https://docs.python.org/2/library/resource.html which is why it worked for you in Ubuntu, but raised an error when trying to use it in Windows.

I ran into similar error in window 10. Here is what solved it for me.
Downgrade to the Apache Spark 2.3.2 prebuild version
Install (or downgrade) jdk to version 1.8.0
My installed jdk was 1.9.0, which doesn't seem to be compatiable with spark 2.3.2 or 2.4.0
make sure that when you run java -version in cmd (command prompt), it show java version 8. If you are seeing version 9, you will need to change your system ENV PATH to ensure it points to java version 8.
Check this link to get help on changing the PATH if you have multiple java version installed.
Hope this helps someone, I was stuck on this issue for almost a week before finally finding a solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.