I'm trying to setup pyspark on my desktop and interact with it via the terminal.
I'm following this guide,
http://jmedium.com/pyspark-in-python/
When I run 'pyspark' in the terminal is says,
/home/jacob/spark-2.1.0-bin-hadoop2.7/bin/pyspark: line 45: python:
command not found
env: ‘python’: No such file or directory
I've followed several guides which all lead to this same issue (some have different details on setting up the .profile. Thus far none have worked correctly).
I have java, python3.6, and Scala installed.
My .profile is configured as follows:
#Spark and PySpark Setup
PATH="$HOME/bin:$HOME/.local/bin:$PATH"
export SPARK_HOME='/home/jacob/spark-2.1.0-bin-hadoop2.7'
export PATH=$SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
#export PYSPARK_DRIVER_PYTHON="jupyter"
#export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3.6.5
Note that jupyter notebook is commented out because I want to launch pyspark in the shell right now with out the notebook starting
Interestingly spark-shell launches just fine
I'm using Ubuntu 18.04.1 and
Spark 2.1
See Images
I've tried every guide I can find, and since this is my first time setting up Spark i'm not sure how to troubleshoot it from here
Thank you
Attempting to execute pyspark
.profile
versions
You should have set export PYSPARK_PYTHON=python3 instead of export PYSPARK_PYTHON=python3.6.5 in your .profile
then source .profile , of course.
That's worked for me.
other options, installing sudo apt python (which is for 2.x ) is not appropriate.
For those who may come across this, I figured it out!
I specifically chose to use an older version of Spark in order to follow along with a tutorial I was watching - Spark 2.1.0. I did not know that the latest version of Python (3.5.6 at the time of writing this) is incompatible with Spark 2.1. Thus PySpark would not launch.
I solved this by using Python 2.7 and setting the path accordingly in .bashrc
export PYTHONPATH=$PYTHONPAH:/usr/lib/python2.7
export PYSPARK_PYTHON=python2.7
People using python 3.8 and Spark <= 2.4.5 will have the same problem.
In this case, the only solution I found is to update spark to V 3.0.0.
Look at https://bugs.python.org/issue38775
for GNU/Linux users that have python3 package installed (ubuntu/debian distro's specially) you can find a package called "python-is-python3" this would help identifying python3 as python command.
# apt install python-is-python3
python 2.7 is deprecated now (2020 ubuntu 20.10) so do not try installing it.
I have already solved this issue. Just type this command:
sudo apt install python
Related
After I install Google cloud sdk in my computer, I open the terminal and type "gcloud --version" but it says "python was not found"
note:
I unchecked the box saying "Install python bundle" when I install Google cloud sdk because I already have python 3.10.2 installed.
so, how do fix this?
Thanks in advance.
As mentioned in the document:
Cloud SDK requires Python; supported versions are Python 3 (preferred,
3.5 to 3.8) and Python 2 (2.7.9 or later). By default, the Windows version of Cloud SDK comes bundled with Python 3 and Python 2. To use
Cloud SDK, your operating system must be able to run a supported
version of Python.
As suggested by #John Hanley the CLI cannot find Python which is already installed. Try reinstalling the CLI selecting install Python bundle. If you are still facing the issue another workaround can be to try with Python version 2.x.x .
You can follow the below steps :
1.Uninstall all Python version 3 and above.
2.Install Python version -2.x.x (I have installed - 2.7.17)
3.Create environment variable - CLOUDSDK_PYTHON and provide value as C:\Python27\python.exe
4.Run GoogleCloudSDKInstaller.exe again.
On ubuntu Linux, you can define this variable in the .bashrc file:
export CLOUDSDK_PYTHON=/usr/bin/python3
On Windows, setting the CLOUDSDK_PYTHON environment variable fixes this, but when I first tried this I pointed the variable to the folder containing the python executable and that didn't work. The variable apparently must point to the executable file.
I Installed Hadoop , Pig, Hive, HBase and Zookeeper successfully.
I installed Apache Phoenix to access HBase. Below are my PATH variables.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH="$PATH:$JAVA_HOME/bin"
export PATH="/home/vijee/anaconda3/bin:$PATH"
export HADOOP_HOME=/home/vijee/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH="$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin"
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export ZOOKEEPER_HOME=/home/vijee/apache-zookeeper-3.6.2-bin
export PATH=$PATH:$ZOOKEEPER_HOME/bin
export HBASE_HOME=/home/vijee/hbase-1.4.13-bin
export PATH=$PATH:$HBASE_HOME/bin
export PHOENIX_HOME=/home/vijee/apache-phoenix-4.15.0-HBase-1.4-bin
export PATH=$PATH:$PHOENIX_HOME/bin
I copied phoenix-4.15.0-HBase-1.4-client.jar, phoenix-4.15.0-HBase-1.4-server.jar, phoenix-core-4.15.0-HBase-1.4.jar to HBase lib directory and restarted Hbase and Zookeeper.
When I run the below Phoenix command, it is throwing error
(base) vijee#vijee-Lenovo-IdeaPad-S510p:~/apache-phoenix-4.15.0-HBase-1.4-bin/bin$ psql.py localhost $PHOENIX_HOME/examples/WEB_STAT.sql $PHOENIX_HOME/examples/WEB_STAT.csv $PHOENIX_HOME/examples/WEB_STAT_QUERIES.sql
Traceback (most recent call last):
File "/home/vijee/apache-phoenix-4.15.0-HBase-1.4-bin/bin/psql.py", line 57, in <module>
if hbase_env.has_key('JAVA_HOME'):
AttributeError: 'dict' object has no attribute 'has_key'
My Python Version
$ python --version
Python 3.8.3
I know it is Python compatability issue and psql.py is written for Python 2.x.
How to resolve this issue?
Briefly searching, it looks like HBase-1.4 is from 2017, while the latest stable is 2.2.5 .. the release notes imply that it works with Python 3
Consider simply using the newer jar Apache Archive Link for stable files
At least psql.py in the latest Apache Phoenix code does appear to support Python 3 https://github.com/apache/phoenix/blob/master/bin/psql.py so you should be able to get a newer version than you have which will work with it.
This can be seen in the latest commit to it
commit history by-file on GitHub: https://github.com/apache/phoenix/commits/master/bin/psql.py
commit for Python3 support with in being fixed
If you must use 1.4.x, you may be able to run psql.py with Python 2 instead. Most operating systems will accept having them installed in parallel, though it may make some dependency management confusing and it is not a maintainable solution.
.has_key was removed in 3.x, use in instead!
See also Should I use 'has_key()' or 'in' on Python dicts?
I am currently on JRE: 1.8.0_181, Python: 3.6.4, spark: 2.3.2
I am trying to execute following code in Python:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Basics').getOrCreate()
This fails with following error:
spark = SparkSession.builder.appName('Basics').getOrCreate()
Traceback (most recent call last):
File "", line 1, in
File "C:\Tools\Anaconda3\lib\site-packages\pyspark\sql\session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "C:\Tools\Anaconda3\lib\site-packages\pyspark\context.py", line 349, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "C:\Tools\Anaconda3\lib\site-packages\pyspark\context.py", line 118, in init
conf, jsc, profiler_cls)
File "C:\Tools\Anaconda3\lib\site-packages\pyspark\context.py", line 195, in _do_init
self._encryption_enabled = self._jvm.PythonUtils.getEncryptionEnabled(self._jsc)
File "C:\Tools\Anaconda3\lib\site-packages\py4j\java_gateway.py", line 1487, in getattr
"{0}.{1} does not exist in the JVM".format(self._fqn, name))
py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM
Any one has any idea on what can be a potential issue here?
Appreciate any help or feedback here. Thank you!
Using findspark is expected to solve the problem:
Install findspark
$pip install findspark
In you code use:
import findspark
findspark.init()
Optionally you can specify "/path/to/spark" in the init method above; findspark.init("/path/to/spark")
As outlined # pyspark error does not exist in the jvm error when initializing SparkContext, adding PYTHONPATH environment variable (with value as:
%SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j-<version>-src.zip:%PYTHONPATH%,
- just check what py4j version you have in your spark/python/lib folder) helped resolve this issue.
Solution #1. Check your environment variables
You are getting “py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM” due to environemnt variable are not set right.
Check if you have your environment variables set right on .bashrc file. For Unix and Mac, the variable should be something like below. You can find the .bashrc file on your home path.
Note: Do not copy and paste the below line as your Spark version might be different from the one mentioned below.
export SPARK_HOME=/opt/spark-3.0.0-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$PATH
If you are running on windows, open the environment variables window, and add/update below.
SPARK_HOME => /opt/spark-3.0.0-bin-hadoop2.7
PYTHONPATH => %SPARK_HOME%/python;%SPARK_HOME%/python/lib/py4j-0.10.9-src.zip;%PYTHONPATH%
PATH => %SPARK_HOME%/bin;%SPARK_HOME%/python;%PATH%
After setting the environment variables, restart your tool or command prompt.
Solution #2. Using findspark
Install findspark package by running $pip install findspark and add the following lines to your pyspark program
import findspark
findspark.init()
# you can also pass spark home path to init() method like below
# findspark.init("/path/to/spark")
Solution #3. Copying the pyspark and py4j modules to Anaconda lib
Sometimes after changing/upgrading Spark version, you may get this error due to version incompatible between pyspark version and pyspark available at anaconda lib. In order to correct it
Note: copy the specified folder from inside the zip files and make sure you have environment variables set right as mentioned in the beginning.
Copy the py4j folder from :
C:\apps\opt\spark-3.0.0-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\
to
C:\Programdata\anaconda3\Lib\site-packages\.
And, copy pyspark folder from :
C:\apps\opt\spark-3.0.0-bin-hadoop2.7\python\lib\pyspark.zip\
to
C:\Programdata\anaconda3\Lib\site-packages\
Sometimes, you may need to restart your system in order to effect eh environment variables.
Credits to : https://sparkbyexamples.com/pyspark/pyspark-py4j-protocol-py4jerror-org-apache-spark-api-python-pythonutils-jvm/
you just need to install an older version of pyspark .This version works"pip install pyspark==2.4.7"
Had the same problem, on Windows, and I found that my Python had different versions of py4j and pyspark than the spark expected.
Solved by copying the python modules inside the zips: py4j-0.10.8.1-src.zip and pyspark.zip (found in spark-3.0.0-preview2-bin-hadoop2.7\python\lib) into C:\Anaconda3\Lib\site-packages.
I had the same problem. In my case with spark 2.4.6, installing pyspark 2.4.6 or 2.4.x, the same version as spark, fixed the problem since pyspark 3.0.1(pip install pyspark will install latest version) raised the problem.
I recently faced this issue.
mistake was - I was opening normal jupyter notebook.
Always open Anaconda Prompt -> type 'pyspark' -> It will automatically open Jupyter notebook for you. After that, you will not get this error.
if use pycharm
- Download spark 2.4.4
- settings/project structure/addcontent root/ add py4j.0.10.8.1.zip
ve pyspark.zip in spark.2.4.4/python/lib
This may happen if you have pip installed pyspark 3.1 and your local spark is 2.4 (I mean versions incompatibility)
In my case, to overcome this, I uninstalled spark 3.1 and switched to pip install pyspark 2.4.
My advice here is check for version incompatibility issues too along with other answers here.
If not already clear from previous answers, your pyspark package version has to be the same as Apache Spark version installed.
For example I use Ubuntu and PySpark 3.2. In the environment variable (bashrc):
export SPARK_HOME="/home/ali/spark-3.2.0-bin-hadoop3.2"
export PYTHON_PATH=$SPARK_HOME/python:$PYTHON_PATH
If you Updated pyspark or spark
If like me the problem occurred after you updated one of the two and you didn't know that Pyspark and Spark version need to match, as the Pyspark PyPi repo says:
NOTE: If you are using this with a Spark standalone cluster you must ensure that the version (including minor version) matches or you may experience odd errors.
Therefor upgrading/downgrading Pyspark/Spark for their version to match solve the issue.
To upgrade Spark follow: https://sparkbyexamples.com/pyspark/pyspark-py4j-protocol-py4jerror-org-apache-spark-api-python-pythonutils-jvm/
If using Spark with the AWS Glue libs locally (https://github.com/awslabs/aws-glue-libs), ensure that Spark, PySpark and the version of AWS Glue libs all align correctly. As of now, the current valid combinations are:
aws-glue-libs branch
Glue Version
Spark Version
glue-0.9
0.9
2.2.1
glue-1.0
1.0
2.4.3
glue-2.0
2.0
2.4.3
master
3.0
3.1.1
Regarding previously mentioned solution with findspark, remember that it must be at the top of your script:
import sys
import findspark
findspark.init()
from...
import...
I have recently get hold of a RackSpace Ubuntu server and it has pythons all over the place:
iPython in 3.5, Pandas in 3.4 &2.7, modules I need like pyodbc etc. are only in 2,7
Therefore, I am keen to clean up the box and, as a 2.7 users, keep everything in 2.7.
So the key question is, is there a way to remove both 3.4 and 3.5 efficiently at the same time while keeping Python 2.7?
Removing Python 3 was the worst thing I did since I recently moved to the world of Linux. It removed Firefox, my launcher and, as I read while trying to fix my problem, it may also remove your desktop and terminal! Finally fixed after a long daytime nightmare. Just don't remove Python 3. Keep it there!
If that happens to you, here is the fix:
https://askubuntu.com/q/384033/402539
https://askubuntu.com/q/810854/402539
EDIT: As pointed out in recent comments, this solution may BREAK your system.
You most likely don't want to remove python3.
Please refer to the other answers for possible solutions.
Outdated answer (not recommended)
sudo apt-get remove 'python3.*'
So I worked out at the end that you cannot uninstall 3.4 as it is default on Ubuntu.
All I did was simply remove Jupyter and then alias python=python2.7 and install all packages on Python 2.7 again.
Arguably, I can install virtualenv but me and my colleagues are only using 2.7. I am just going to be lazy in this case :)
First of all, don't try the following command as suggested by Germain above.
`sudo apt-get remove 'python3.*'`
In Ubuntu, many software depends upon Python3 so if you will execute this command it will remove all of them as it happened with me. I found following answer useful to recover it.
https://askubuntu.com/questions/810854/i-deleted-package-python3-on-ubuntu-and-i-have-lost-dashboard-terminal-and-un
If you want to use different python versions for different projects then create virtual environments it will be very useful. refer to the following link to create virtual environments.
Creating Virtual Environment also helps in using Tensorflow and Keras in Jupyter Notebook.
https://linoxide.com/linux-how-to/setup-python-virtual-environment-ubuntu/
neither try any above ways nor sudo apt autoremove python3 because it will remove all gnome based applications from your system including gnome-terminal. In case if you have done that mistake and left with kernal only than trysudo apt install gnome on kernal.
try to change your default python version instead removing it. you can do this through bashrc file or export path command.
Its simple
just try:
sudo apt-get remove python3.7 or the versions that you want to remove
Problem: Running Python 2.7.3 with IDLE. Want to add packages like numpy. Cannot control where they are installed. While installing packages using the terminal, I do not seem to be able to direct them to, say,
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/
At the same time, cannot tell 2.7.3 to look at where they are no matter what I include in the bash-profile, profile or profile_pysave. I've read more than 50 suggestions all over the web (including Stackoverflow), but none of them seem to work.
One constraint on the solution is that I do not wish to create an isolated environment as suggested by PATH issues with homebrew-installed Python 2 and Python 3 on OSX.
This is how my bash-profile looks (yes, I tried using Anaconda, and gave up):
export PATH="/usr/local/bin:$PATH"
export PATH="/usr/local/bin:/usr/local/share/python:$PATH"
PYTHONPATH="${PYTHONPATH}:/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-
packages/"
export PYTHONPATH
# added by Anaconda 1.5.1 installer
export PATH="/Users/YalcinU/anaconda/bin:$PATH"
# added by Anaconda 1.5.1 installer
export PATH="/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages
/anaconda/bin:$PATH"
How can you setup PYTHONPATH in OSX?