How to install NLTK modules in Heroku

How to install NLTK modules in Heroku - python

Hey i'd like to install the NLTK pos_tag on my Heroku server. How can i do so. Please give me the steps as im new to the Heroku server system.

I just added official nltk support to the buildpack!
Simply add a nltk.txt file with a list of corpora you want installed, and everything should work as expected.

Update
As Kenneth Reitz pointed out, a much simpler solution has been added to the heroku-python-buildpack. Add a nltk.txt file to your root directory and list your corpora inside. See https://devcenter.heroku.com/articles/python-nltk for details.
Original Answer
Here's a solution that allows you to install the NLTK data directly on Heroku without adding it to your git repo.
I used similar steps to install Textblob on Heroku, which uses NLTK as a dependency. I've made some minor adjustments to my original code in steps 3 and 4 that should work for an NLTK only installation.
The default heroku buildpack includes a post_compile step that runs after all of the default build steps have been completed:
# post_compile
#!/usr/bin/env bash
if [ -f bin/post_compile ]; then
echo "-----> Running post-compile hook"
chmod +x bin/post_compile
sub-env bin/post_compile
fi
As you can see, it looks in your project directory for your own post_compile file in the bin directory, and it runs it if it exists. You can use this hook to install the nltk data.
Create the bin directory in the root of your local project.
Add your own post_compile file to the bin directory.
# bin/post_compile
#!/usr/bin/env bash
if [ -f bin/install_nltk_data ]; then
echo "-----> Running install_nltk_data"
chmod +x bin/install_nltk_data
bin/install_nltk_data
fi
echo "-----> Post-compile done"
Add your own install_nltk_data file to the bin directory.
# bin/install_nltk_data
#!/usr/bin/env bash
source $BIN_DIR/utils
echo "-----> Starting nltk data installation"
# Assumes NLTK_DATA environment variable is already set
# $ heroku config:set NLTK_DATA='/app/nltk_data'
# Install the nltk data
# NOTE: The following command installs the averaged_perceptron_tagger corpora,
# so you may want to change for your specific needs.
# See http://www.nltk.org/data.html
python -m nltk.downloader averaged_perceptron_tagger
# If using Textblob, use this instead:
# python -m textblob.download_corpora lite
# Open the NLTK_DATA directory
cd ${NLTK_DATA}
# Delete all of the zip files
find . -name "*.zip" -type f -delete
echo "-----> Finished nltk data installation"
Add nltk to your requirements.txt file (Or textblob if you are using Textblob).
Commit all of these changes to your repo.
Set the NLTK_DATA environment variable on your heroku app.
$ heroku config:set NLTK_DATA='/app/nltk_data'
Deploy to Heroku. You will see the post_compile step trigger at the end of the deployment, followed by the nltk download.
I hope you found this helpful! Enjoy!

If you want to use simple functionalities like pos_tag, tokenizer, stemming, etc. then you can do the following steps
mention nltk in requirements.txt
mention following modules in nltk.txt
wordnet
pros_cons
reuters
hmm_treebank_pos_tagger
maxent_treebank_pos_tagger
universal_tagset
punkt
averaged_perceptron_tagger_ru
averaged_perceptron_tagger
snowball_data
rslp
porter_test
vader_lexicon
treebank
dependency_treebank

You need to follow the below steps.
nltk.txt needs to present at the root folder
Add the modules you want to download like punkt, stopwords as separate row items
Change the line ending from windows to UNIX
Changing the line ending is a very important step. Can be easily done through Sublime Text or Notepad++. In Sublime Text, it can done from the View menu, then Line Endings.
Hope this helps

Related

Is there a way to download TextBlob corpora to Google Cloud Run?

I am using Python with TextBlob for sentiment analysis. I want to deploy my app (build in Plotly Dash) to Google Cloud Run with Google Cloud Build (without using Docker). When using locally on my virtual environment all goes fine, but after deploying it on the cloud the corpora is not downloaded. Looking at the requriements.txt file, there was also no reference to this corpora.
I have tried to add python -m textblob.download_corpora to my requriements.txt file but it doesn't download when I deploy it. I have also tried to add
import textblob
import subprocess
cmd = ['python','-m','textblob.download_corpora']
subprocess.run(cmd)
and
import nltk
nltk.download('movie_reviews')
to my script (callbacks.py, I am using Plotly Dash to make my app), all without success.
Is there a way to add this corpus to my requirements.txt file? Or is there another workaround to download this corpus? How can I fix this?
Thanks in advance!
Vijay

Since Cloud Run creates and destroys containers as needed for your traffic levels you'll want to embed your corpora in the pre-built container to ensure a fast cold start time (instead of downloading it when the container starts)
The easiest way to do this is add another line inside of a docker file that downloads and installs the corpora at build time like so:
RUN python -m textblob.download_corpora
Here's a full docker file for your reference:
# Python image to use.
FROM python:3.8
# Set the working directory to /app
WORKDIR /app
# copy the requirements file used for dependencies
COPY requirements.txt .
# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
RUN python -m textblob.download_corpora
# Copy the rest of the working directory contents into the container at /app
COPY . .
# Run app.py when the container launches
ENTRYPOINT ["python", "app.py"]
Good luck,
Josh

Problems using debuild to upload a python/GTK program to Launchpad

[update, I found the solution, see answer below]
I made a GUI wrapper for protonvpn, a cmd program for Linux. dpkg -b gets me ProtonVPNgui.deb, which works fine. However, I have problems using debuild -S -sa to upload it to Launchpad.
As is, it won't build once uploaded with dput, cf. the error msg
I tried using debuild -i -us -uc -b to build a .deb file for local testing, but it returns:
dpkg-genchanges: error: binary build with no binary artifacts found; cannot distribute
Any ideas? This whole process is driving me nuts. (I use this tar.gz)

I figured it out myself. Create a .deb package locally for testing and upload the project to Launchchpad:
Create a launchpad user account.
Install dh-python with the package manager
Create the package source dir
mkdir myscript-0.1
Copy your python3 script(s) (or the sample script below) to the source dir (don't use !/usr/bin/python, use !/usr/bin/python3 or !/usr/bin/python2 and edit accordingly below)
cp ~/myscript myscript-0.1
cd myscript-0.1
Sample script:
#!/usr/bin/python3
if __name__ == '__main__':
print("Hello world")
Create the packaging skeleton (debian/*)
dh_make -s --createorig
Remove the example files
rm debian/*.ex debian/*.EX debian/README.*
Add eventual binary files to include, e.g. gettext .mo files
mkdir myscript-0.1/source
echo debian/locales/es/LC_MESSAGES/base.mo > myscript-0.1/source/include-binaries
Edit debian/control
Replace its content with the following text:
Source: myscript
Section: utils
Priority: optional
Maintainer: Name,
Build-Depends: debhelper (>= 9), python3, dh-python
Standards-Version: 4.1.4
X-Python3-Version: >= 3.2
Package: myscript
Architecture: all
Depends: ${misc:Depends}, ${python3:Depends}
Description: insert up to 60 chars description
insert long description, indented with spaces
debian/install must contain the script(or several, python, perl, etc., also eventual .desktop files for start menu shortcuts) to install as well as the target directories, each on a line
echo myscript usr/bin > debian/install
Edit debian/rules
Replace its content with the following text:
#!/usr/bin/make -f
%:
dh $# --with=python3
Note: it's a TAB before dh $#, not four spaces!
Build the .deb package
debuild -us -uc
You will get a few Lintian warnings/errors but your package is ready to be used:
../myscript_0.1-1_all.deb
Prepare upload to Launchpad, insert your gdp fingerprint after -k
debuild -S -sa -k12345ABC
Upload to Launchpad
dput ppa:[your ppa name]/ppa myscript_0.1-1_source.changes
This is an update to askubuntu.com/399552. It may take some error messages and googling till you're ready... C.f. the ...orig.tar.gz file at launchpad for the complete project.

How to create any AWS Lambda Python Layer? (Usage example with XGBoost)

I am having trouble creating a lambda layer for the xgboost library. Im running:
Im grabbing a zip of xgboost and it's dependencies from here (https://github.com/alexeybutyrev/aws_lambda_xgboost) and loading it into a layer. When I try to test my lambda, I get this error:
Unable to import module 'lambda_function': No module named 'xgboost.core'
It looks like __init__.py is trying to reference core.py via from .core import <stuff>
Has anyone encountered this error with AWS Lambda before?

EDIT: As #Marcin has remark, the first answer provided works for packages under 262 MB large.
A. Python Packages within Lambda Layer size limit
You can also do it with AWS sam cli and Docker (see this link to install the SAM cli), to build the packages inside a container. Basically you initialize a default template with Python as runtime and then you specify the packages under the requirements.txt file. I found it more easy than the article you mentioned. I let you steps if you want to consider them for future use.
1. Initialize a default SAM template
Under any folder that you want to keep the project, you can type
sam init
this will prompt a series of questions, for a quick set up we will be choosing the Quick Start Templates as follows
1 - AWS Quick Start Templates
2 - Python 3.8
Project name [sam-app]: your_project_name
1 - Hello World Example
By choosing the Hello World Example it generates a default lambda function with a requirements.txt file. Now, we're going to edit with the name of the package that you want, in this case xgboost
2. Specify packages to install
cd your_project_name
code hello_world/requirements.txt
as I have Visual Studio Code as editor, this will open the file on it. Now, I can specify the xgboost package
your_python_package
Here comes the reason to have Docker installed. Some packages relied on C++. Thus, it is recommended to build inside a container (case on Windows). Now, move to the folder where the template.yaml file is located. Then, type
sam build -u
3. Zip packages
there are some files that you do not want to be included in your lambda layer, because we only want to keep the python libraries. Thus, you could remove the following files
rm .aws-sam/build/HelloWorldFunction/app.py
rm .aws-sam/build/HelloWorldFunction/__init__.py
rm .aws-sam/build/HelloWorldFunction/requirements.txt
and then zip the remaining content of the folder.
cp -r .aws-sam/build/HelloWorldFunction/ python/
zip -r my_layer.zip python/
where we place the layer in the python/ folder according to the docs
On Windows system the zip command should be replaced with
Compress-Archive my_layer/ my_layer.zip.
4. Upload your Layer to AWS
On AWS go to Lambda, then choose Layers and Create Layer. Now, you can upload your .zip file as the image below shows
Notice that for zip files over 50 MB, you should upload the .zip file to an s3 bucket and provide the path, for exampl, https://s3:amazonaws.com//mybucket/my_layer.zip.
B. Python packages that exceeds Lambda Layer limits
The xgboost package on its own is more than 300 MB and will throw the following error
As #Marcin has kindly pointed out, the prior approach with SAM cli would not directly work for Python layers that exceed the limit. There's an open issue on github to specify a custom docker image when running sam build -u and a possible solution retagging the default lambda/lambci image.
So, how could we pass through this?. There are already some useful resources that I would just point to.
First, the Medium article that #Alex took as solution that follow this repo code.
Second, alexeybutyrev approach that works by applying the strip command to reduce the libraries sizes. One can find this approach under a github repo, the instructions are provided.
Edit (December 2020)
This month AWS releases container Image support for AWS Lambda. Following the next tree structure for your project
Project/
|-- app/
| |-- app.py
| |-- requirements.txt
| |-- xgb_trained.bin
|-- Dockerfile
You can deploy an XGBoost model with the following Docker image. Follow this repo instructions for a detailed explanation.
# Dockerfile based on https://docs.aws.amazon.com/lambda/latest/dg/images-create.html
# Define global args
ARG FUNCTION_DIR="/function"
ARG RUNTIME_VERSION="3.6"
# Choose buster image
FROM python:${RUNTIME_VERSION}-buster as base-image
# Install aws-lambda-cpp build dependencies
RUN apt-get update && \
apt-get install -y \
g++ \
make \
cmake \
unzip \
libcurl4-openssl-dev \
git
# Include global arg in this stage of the build
ARG FUNCTION_DIR
# Create function directory
RUN mkdir -p ${FUNCTION_DIR}
# Copy function code
COPY app/* ${FUNCTION_DIR}/
# Install python dependencies and runtime interface client
RUN python${RUNTIME_VERSION} -m pip install \
--target ${FUNCTION_DIR} \
--no-cache-dir \
awslambdaric \
-r ${FUNCTION_DIR}/requirements.txt
# Install xgboost from source
RUN git clone --recursive https://github.com/dmlc/xgboost
RUN cd xgboost; make -j4; cd python-package; python${RUNTIME_VERSION} setup.py install; cd;
# Multi-stage build: grab a fresh copy of the base image
FROM base-image
# Include global arg in this stage of the build
ARG FUNCTION_DIR
# Set working directory to function root directory
WORKDIR ${FUNCTION_DIR}
# Copy in the build image dependencies
COPY --from=base-image ${FUNCTION_DIR} ${FUNCTION_DIR}
ENTRYPOINT [ "/usr/local/bin/python", "-m", "awslambdaric" ]
CMD [ "app.handler" ]

So I was never able to figure out why it failed in this way. The solution I found that worked was to create an EC2 instance running amazon linux, install and zip the libraries there, then save to S3. See here for detailed instructions:
https://medium.com/#lucashenriquessilva/how-to-create-a-aws-lambda-python-layer-db2830e08b12

How do I add more python modules to my yocto/openembedded project?

I wish to add more python modules to my yocto/openembedded project but I am unsure how to? I wish to add flask and its dependencies.

some python packages having corresponding recipes in the meta folders, like Enum class for example:
meta-openembedded/meta-python/recipes-devtools/python/python-enum34_1.1.6.bb
unfortunately lot's of useful classes aren't available, but some might be needed for the python application. get used of installing missing packages using pip already on booted platform? but what if the target product is not IP network connected? the solution is to implement a new recipe and add to the platform meta layer (at least). Example is a recipe for the module keyboard useful for intercepting keys/buttons touch events:
use PyPi web site to identify if the package is available:
https://pypi.org/project/keyboard/
download archive available on the package description page:
https://github.com/boppreh/keyboard/archive/master.zip
collect some useful information required to fill-out a new recipe:
SUMMARY - could be obtained from the package description page
HOMEPAGE - the project URL on github or bitbucket or sourceforge, etc
LICENSE - verify license type
LIC_FILES_CHKSUM by executing md5sum on existing LICENSE or README or PKG-INFO file located in the root of the package (preferrably)
SRC_URI[md5sum] - is md5sum of the archive itself. it will be used to discover and download archive on pypi server automatically with the help of supporting script inherit pypi
PYPI_PACKAGE_EXT - if the package is not tar.gz require to supply the correct extension
create missing python-keyboard_0.13.1.bb recipe:
`
SUMMARY = "Hook and simulate keyboard events on Windows and Linux"
HOMEPAGE = "https://github.com/boppreh/keyboard"
LICENSE = "BSD-3-Clause"
LIC_FILES_CHKSUM = "file://PKG-INFO;md5=9bc8ba91101e2f378a65d36f675c88b7"
SRC_URI[md5sum] = "d4b90e53bbde888e7b7a5a95fe580a30"
SRC_URI += "file://add_missing_CHANGES_md.patch"
PYPI_PACKAGE = "keyboard"
PYPI_PACKAGE_EXT = "zip"
inherit pypi
inherit setuptools
BBCLASSEXTEND = "native nativesdk"
`
the package has been patched by adding
SRC_URI += "file://add_missing_CHANGES_md.patch"
directive to the recipe due to missing CHANGES.md file used by setup.py script to identify package version (this step is optional). the patch itself has to be placed inside the folder next to the recipe matching recipe name but without version:
python-keyboard

This question is old, but currently in 2020 there is a python package called pipoe.
pipoe can generate .bb classes corresponding to python packages for you!
Usage:
$ pip3 install pipoe
$ pipoe -p requests
OR
$ pipoe -p requests --python python3
Now copy the generated .bb files to your layer and use them.
https://pypi.org/project/pipoe/

The OE layer index at layers.openembedded.org lists all known layers and the recipes they contain, so searching that should bring up the meta-python layer that you can add to your build and use recipes from.

In your Image recipe you can add a Python module by adding it to the IMAGE_INSTALL variable:
IMAGE_INSTALL += "python-numpy"
You can find possible modules for example by searching for them with wildcards:
find -name *python*numpy*bb
in the Yocto Folder brings:
./poky/meta/recipes-devtools/python/python-numpy_1.7.0.bb

The pipoe did not work for me either, I ended up making this bash script. Someone else might find it useful.
You will need to change this in my script below:
local my_layers_dir="my/layers/directory"
To run this script:
./pypi.sh <modulename>
#example:
./pypi.sh humanfriendly #this should generate the bb file for the humanfriendly python module
pypi.sh:
#!/bin/bash
set -ex
function argstovars()
{
for change in $#; do
set -- `echo $change | tr '=' ' '`
eval $1=$2
done
}
function main(){
local module=""
argstovars $#
local my_layers_dir="my/layers/directory"
local url_files="https://pypi.org/project/$module/#files"
mkdir -p /tmp/pypi
rm -fr /tmp/pypi/*
pushd /tmp/pypi
wget $url_files
local targz_url=$(cat index.html | grep https://files | grep tar.gz | sed -r "s/<a href=\"(.*)\">/\1/g")
wget $targz_url
local targz_file=$(ls | grep tar.gz)
local md5=$(md5sum $targz_file)
md5=${md5%% *}
local sha256=$(sha256sum $targz_file)
sha256=${sha256%% *}
tar -xf $targz_file
local module_with_version=$(echo "$targz_file" | sed -r "s/(.*)\.tar\.gz/\1/g")
pushd $module_with_version
local license_file=$(find . -name LICENSE*)
local md5lic=$(md5sum $license_file)
md5lic=${md5lic%% *}
popd
popd
module_with_version="${module_with_version//-/_}"; echo $foo
mkdir -p "$my_layers_dir/$module"
pushd "$my_layers_dir/$module"
echo "SUMMARY = \"This is a python module for $module\"
HOMEPAGE = \"https://pypi.org/project/$module/\"
LICENSE = \"MIT\"
LIC_FILES_CHKSUM = \"file://$license_file;md5=$md5lic\"
SRC_URI[md5sum] = \"$md5\"
SRC_URI[sha256sum] = \"$sha256\"
PYPI_PACKAGE = \"$module\"
inherit pypi setuptools3
RDEPENDS_${PN} += \" \
python3-psutil \
\"
" > "${module_with_version}.bb"
popd
}
time main module=$#

DistutilsOptionError: must supply either home or prefix/exec-prefix -- not both

I've been usually installed python packages through pip.
For Google App Engine, I need to install packages to another target directory.
I've tried:
pip install -I flask-restful --target ./lib
but it fails with:
must supply either home or prefix/exec-prefix -- not both
How can I get this to work?

Are you using OS X and Homebrew? The Homebrew python page https://github.com/Homebrew/brew/blob/master/docs/Homebrew-and-Python.md calls out a known issue with pip and a work around.
Worked for me.
You can make this "empty prefix" the default by adding a
~/.pydistutils.cfg file with the following contents:
[install]
prefix=
Edit: The Homebrew page was later changed to recommend passing --prefix on the command line, as discussed in the comments below. Here is the last version which contained that text. Unfortunately this only works for sdists, not wheels.
The issue was reported to pip, which later fixed it for --user. That's probably why the section has now been removed from the Homebrew page. However, the problem still occurs when using --target as in the question above.

I believe there is a simpler solution to this problem (Homebrew's Python on macOS) that won't break your normal pip operations.
All you have to do is to create a setup.cfg file at the root directory of your project, usually where your main __init__.py or executable py file is. So if the root folder of your project is: /path/to/my/project/, create a setup.cfg file in there and put the magic words inside:
[install]
prefix=
OK, now you sould be able to run pip's commands for that folder:
pip install package -t /path/to/my/project/
This command will run gracefully for that folder only. Just copy setup.cfg to whatever other projects you might have. No need to write a .pydistutils.cfg on your home directory.
After you are done installing the modules, you may remove setup.cfg.

On OSX(mac), assuming a project folder called /var/myproject
cd /var/myproject
Create a file called setup.cfg and add
[install]
prefix=
Run pip install <packagename> -t .

Another solution* for Homebrew users is simply to use a virtualenv.
Of course, that may remove the need for the target directory anyway - but even if it doesn't, I've found --target works by default (as in, without creating/modifying a config file) when in a virtual environment.
*I say solution; perhaps it's just another motivation to meticulously use venvs...

I hit errors with the other recommendations around --install-option="--prefix=lib". The only thing I found that worked is using PYTHONUSERBASE as described here.
export PYTHONUSERBASE=lib
pip install -I flask-restful --user
this is not exactly the same as --target, but it does the trick for me in any case.

As other mentioned, this is known bug with pip & python installed with homebrew.
If you create ~/.pydistutils.cfg file with "empty prefix" instruction it will fix this problem but it will break normal pip operations.
Until this bug is officially addressed, one of the options would be to create your own bash script that would handle this case:
#!/bin/bash
name=''
target=''
while getopts 'n:t:' flag; do
case "${flag}" in
n) name="${OPTARG}" ;;
t) target="${OPTARG}" ;;
esac
done
if [ -z "$target" ];
then
echo "Target parameter must be provided"
exit 1
fi
if [ -z "$name" ];
then
echo "Name parameter must be provided"
exit 1
fi
# current workaround for homebrew bug
file=$HOME'/.pydistutils.cfg'
touch $file
/bin/cat <<EOM >$file
[install]
prefix=
EOM
# end of current workaround for homebrew bug
pip install -I $name --target $target
# current workaround for homebrew bug
rm -rf $file
# end of current workaround for homebrew bug
This script wraps your command and:
accepts name and target parameters
checks if those parameters are empty
creates ~/.pydistutils.cfg file with "empty prefix" instruction in it
executes your pip command with provided parameters
removes ~/.pydistutils.cfg file
This script can be changed and adapted to address your needs but you get idea. And it allows you to run your command without braking pip. Hope it helps :)

If you're using virtualenv*, it might be a good idea to double check which pip you're using.
If you see something like /usr/local/bin/pip you've broken out of your environment. Reactivating your virtualenv will fix this:
VirtualEnv: $ source bin/activate
VirtualFish: $ vf activate [environ]
*: I use virtualfish, but I assume this tip is relevant to both.

I have a similar issue.
I use the --system flag to avoid the error as I decribe here on other thread where I explain the specific case of my situation.
I post this here expecting that can help anyone facing the same problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.