I have an EMR cluster using EMR-6.3.1.
I am using the Python3 Kernel.
I have a very simple bootstrap script in S3:
#!/bin/bash
sudo python3 -m pip install Cython==0.29.4 boto==2.49.0 boto3==1.18.50 numpy==1.19.5 pandas==1.3.2 pyarrow==5.0.0
These are the bootstrap logs
+ sudo python3 -m pip install Cython==0.29.4 boto==2.49.0 boto3==1.18.50 numpy==1.19.5 pandas==1.3.2 pyarrow==5.0.0
WARNING: Running pip install with root privileges is generally not a good idea. Try `python3 -m pip install --user` instead.
WARNING: The scripts cygdb, cython and cythonize are installed in '/usr/local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The scripts f2py, f2py3 and f2py3.7 are installed in '/usr/local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script plasma_store is installed in '/usr/local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
and
Collecting Cython==0.29.4
Downloading Cython-0.29.4-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
Requirement already satisfied: boto==2.49.0 in /usr/local/lib/python3.7/site-packages (2.49.0)
Collecting boto3==1.18.50
Downloading boto3-1.18.50-py3-none-any.whl (131 kB)
Collecting numpy==1.19.5
Downloading numpy-1.19.5-cp37-cp37m-manylinux2010_x86_64.whl (14.8 MB)
Collecting pandas==1.3.2
Downloading pandas-1.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
Collecting pyarrow==5.0.0
Downloading pyarrow-5.0.0-cp37-cp37m-manylinux2014_x86_64.whl (23.6 MB)
Collecting s3transfer<0.6.0,>=0.5.0
Downloading s3transfer-0.5.2-py3-none-any.whl (79 kB)
Collecting botocore<1.22.0,>=1.21.50
Downloading botocore-1.21.65-py3-none-any.whl (8.0 MB)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3==1.18.50) (0.10.0)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/site-packages (from pandas==1.3.2) (2021.1)
Collecting python-dateutil>=2.7.3
Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting urllib3<1.27,>=1.25.4
Downloading urllib3-1.26.13-py2.py3-none-any.whl (140 kB)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas==1.3.2) (1.13.0)
Installing collected packages: Cython, python-dateutil, urllib3, botocore, s3transfer, boto3, numpy, pandas, pyarrow
Successfully installed Cython-0.29.4 boto3-1.18.50 botocore-1.21.65 numpy-1.19.5 pandas-1.3.2 pyarrow-5.0.0 python-dateutil-2.8.2 s3transfer-0.5.2 urllib3-1.26.13
From a notebook, importing pandas and seeing the wrong version - 1.2.3.
Further, I see pyarrow fails to import.
I've printed the import path of pandas, which python version is run, and sys.path.
import os
import pandas
import sys
print(sys.path)
print(pandas.__version__)
print(os.path.abspath(pandas.__file__))
print(os.popen('echo $PYTHONPATH').read())
print(os.popen('which python3').read())
# sys.path.append('/usr/local/lib64/python3.7/site-packages') # if I add this, pyarrow can import
import pyarrow
['/', '/emr/notebook-env/lib/python37.zip', '/emr/notebook-env/lib/python3.7', '/emr/notebook-env/lib/python3.7/lib-dynload', '', '/emr/notebook-env/lib/python3.7/site-packages', '/emr/notebook-env/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg', '/emr/notebook-env/lib/python3.7/site-packages/IPython/extensions', '/home/emr-notebook/.ipython']
1.2.3
/emr/notebook-env/lib/python3.7/site-packages/pandas/__init__.py
/usr/bin/python3
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-aea9862499ce> in <module>
9
10 # sys.path.append('/usr/local/lib64/python3.7/site-packages') # if I add this, pyarrow can import
---> 11 import pyarrow
ModuleNotFoundError: No module named 'pyarrow'
I found I can import pyarrow if I add /usr/local/lib64/python3.7/site-packages to sys.path. This seems like am improvement, but still the wrong version of pandas is imported.
I've tried:
SSH'ing into the master node and mucking with the configuration.
sudo python3 -m pip install --user ...
export PYTHONPATH=/usr/local/lib64/python3.7/site-packages && sudo python3 -m pip install ...
sudo pip3 install --upgrade setuptools && sudo python3 -m pip install ...
Using a pyspark kernel and running sc.install_pypi_package("pandas==1.3.2")
Any help is appreciated. Thank you.
The bootstrap configuration on EMR is not the last step before the cluster is WAITING and EMR Steps start running.
On my emr cluster I found that at the least these packages were logged as installed after the bootstrap configuration ran. I was having issues with numpy not upgrading.
Python packages installed post bootstrap
2022-12-10 00:10:28,250 INFO
main: Took 1 minute, 3 seconds and 451 milliseconds to install
packages:
[emr-scripts, emr-s3-select, aws-sagemaker-spark-sdk,
python27-numpy, python27-sagemaker_pyspark, python37-numpy,
python37-sagemaker_pyspark, emr-ddb, hadoop-yarn-nodemanager, docker,
hadoop-yarn, spark-yarn-shuffle, bigtop-utils, cloudwatch-sink,
hadoop, hadoop-lzo, emr-goodies, emrfs, hadoop-mapreduce, hadoop-hdfs,
R-core, aws-hm-client, emr-kinesis, hadoop-hdfs-datanode,
spark-datanucleus]
A work around to in your first cluster step to add the installition
...
Steps: [
{
'Name': 'Install Pandas',
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar":
"s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
"Args": [
"s3://bucket/prefix/install_packages.sh"
]
}
},
]
Or if you want to use command-runner
{
"Name": "Install Pandas",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"bash",
"-c",
" aws s3 cp s3://bucket/prefix/install.sh .; chmod +x install.sh; ./install.sh; rm install_pandas.sh "
]
}
}
In my example my install.sh file looks like
#!/bin/bash
set -x
sudo pip3 freeze # for debugging to view the previous package versions
sudo pip3 uninstall numpy -y -v
sudo yum install python3-devel -y
sudo pip3 install boto3==1.26.26 -v
sudo pip3 install numpy==1.21.6 -v
sudo pip3 install pandas==1.3.5 -v
sudo pip3 freeze # for debugging to view the post-install package versions
I am currently trying to install a requirements and it is telling me that it is not found when I try and comment them out it happens for others.
I just deployed a Ubuntu 18.04 server. Made the virtual env by the following command python3 -m venv --system-site-packages env but every single time I try and run pip install -r requirements.txt it fails with
Collecting apparmor==2.12 (from -r requirements.txt (line 1))
Could not find a version that satisfies the requirement apparmor==2.12 (from -r requirements.txt (line 1)) (from versions: )
No matching distribution found for apparmor==2.12 (from -r requirements.txt (line 1))
if I try and install say pip install apparmor it tells me
Collecting apparmor
Could not find a version that satisfies the requirement apparmor (from versions: )
No matching distribution found for apparmor
But then if I comment out apparmor it tells me this
Collecting apturl==0.5.2 (from -r requirements.txt (line 2))
Could not find a version that satisfies the requirement apturl==0.5.2 (from -r requirements.txt (line 2)) (from versions: )
No matching distribution found for apturl==0.5.2 (from -r requirements.txt (line 2))
and it goes on for others randomly. The requirements was made on my local which is also ubuntu 18 so unsure why this works on local but not on a new deploy.
I have also made sure that it's the newest version of pip
apparmor and apturl are Ubuntu packages, you can safely ignore them if your code doesn't use their code; just remove them from requirements.txt. If your code depends on them, ensure they are installed via apt:
apt install -y apparmor apturl && pip install -r requirements.txt
This is a common problem when you don't use virtual enviroment for work with python, so your requirements.txt lists all the packages pythons of your system or OS, when you must have only the packages from your project. In some moment you update your requirements.txt with pip freeze > requirements.txt, without a virtual environment and you updated the requirements.txt with all python packages in your OS and from your project, and maybe uploaded to a repository. So when you want to run in another computer and install all packages you got this kind of error...
Python is installed by default in ubuntu, you must consider this and in other system too.
First rule is work every time with virtual enviroment "virtual env documentation"
I know it's hard work, but you can backup that requirements.txt and clean it. Then try to run your program without any package (a clean install) and when errors occurs from missing packages you add it and update with pip freeze > requirements.txt
I have a Python project that runs in a docker container and I am trying to convert to a multistage docker build process. My project depends on the cryptography package. My Dockerfile consists of:
# Base
FROM python:3.6 AS base
RUN pip install cryptography
# Production
FROM python:3.6-alpine
COPY --from=base /root/.cache /root/.cache
RUN pip install cryptography \
&& rm -rf /root/.cache
CMD python
Which I try to build with e.g:
docker build -t my-python-app .
This process works for a number of other Python requirements I have tested, such as pycrypto and psutil, but throws the following error for cryptography:
Step 5/6 : RUN pip install cryptography && rm -rf /root/.cache
---> Running in ebc15bd61d43
Collecting cryptography
Downloading cryptography-2.1.4.tar.gz (441kB)
Collecting idna>=2.1 (from cryptography)
Using cached idna-2.6-py2.py3-none-any.whl
Collecting asn1crypto>=0.21.0 (from cryptography)
Using cached asn1crypto-0.24.0-py2.py3-none-any.whl
Collecting six>=1.4.1 (from cryptography)
Using cached six-1.11.0-py2.py3-none-any.whl
Collecting cffi>=1.7 (from cryptography)
Downloading cffi-1.11.5.tar.gz (438kB)
Complete output from command python setup.py egg_info:
No working compiler found, or bogus compiler options passed to
the compiler from Python's standard "distutils" module. See
the error messages above. Likely, the problem is not related
to CFFI but generic to the setup.py of any Python package that
tries to compile C code. (Hints: on OS/X 10.8, for errors about
-mno-fused-madd see http://stackoverflow.com/questions/22313407/
Otherwise, see https://wiki.python.org/moin/CompLangPython or
the IRC channel #python on irc.freenode.net.)
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-uyh9_v63/cffi/
Obviously I was hoping not to have to install any compiler on my production image. Do I need to copy across another directory other than /root/.cache?
There is no manylinux wheel for Alpine, so you need to compile it yourself. Below is pasted from documentation on installation. Install and remove build dependencies in the same command to only save the package to the docker image layer.
If you are on Alpine or just want to compile it yourself then
cryptography requires a compiler, headers for Python (if you’re not
using pypy), and headers for the OpenSSL and libffi libraries
available on your system.
Alpine Replace python3-dev with python-dev if you’re using Python 2.
$ sudo apk add gcc musl-dev python3-dev libffi-dev openssl-dev
If you get an error with openssl-dev you may have to use libressl-dev.
Docs can be found here
I hope, my answer will be useful.
You should use --user option for cryptography installing via pip in base stage. Example: RUN pip install --user cryptography. This option means, that all files will be installed in the .local directory of
the current user’s home directory.
COPY --from=base /root/.local /root/.local, because cryptography installed in /root/.local.
Thats all. Full example docker multistage
# Base
FROM python:3.6 AS base
RUN pip install --user cryptography
# Production
FROM python:3.6-alpine
COPY --from=base /root/.local /root/.local
RUN pip install cryptography \
&& rm -rf /root/.cache
CMD python
I'm trying to create "offline package" for python code.
I'm running
pip install -d <dest dir> -r requirements.txt
The thing is that cffi==1.6.0 (inside requirements.txt) doesn't get built into a wheel.
Is there a way I can make it? (I trying to avoid the dependency in gcc in the target machine)
install -d just downloads the packages; it doesn't build them. To force everything to be built into wheels, use the pip wheel command instead:
pip wheel -w <dir> -r requirements.txt
I want to use wheels on my Linux server as it seems much faster, but when I do:
pip install wheel
pip wheel -r requirements_dev.txt
Which contains the following packages
nose
django_coverage
coverage
I get coverage: command not found, it's like it is not being installed.
Is there a fallback if a wheel is not found to pip install or have I not understood/setup this correctly?
Can you try this?
virtualenv venv
source venv/bin/activate
pip -r install requirement.txt
Also getting this with using wheel:-
pip wheel -r check.txt
Collecting nose (from -r check.txt (line 1))
Using cached nose-1.3.7-py2-none-any.whl
Saved ./nose-1.3.7-py2-none-any.whl
Collecting django_coverage (from -r check.txt (line 2))
Saved ./django_coverage-1.2.4-cp27-none-any.whl
Collecting coverage (from -r check.txt (line 3))
Using cached coverage-4.2-cp27-cp27m-macosx_10_10_x86_64.whl
Saved ./coverage-4.2-cp27-cp27m-macosx_10_10_x86_64.whl
Skipping nose, due to already being wheel.
Skipping django-coverage, due to already being wheel.
Skipping coverage, due to already being wheel.
Installing from wheels is what pip already does by default. pip wheel is for creating wheels from your requirements file.