Using Tesseract on Heroku with Django

Using Tesseract on Heroku with Django - python

I would like to add OCR capabilities to my Django app running on Heroku. I suspect the easiest way is by using Tesseract. I've noticed that there are a number of python wrappers for Tesseract's API, but what is the best way to get Tesseract installed and running on Heroku? Via a custom buildpack like heroku-buildpack-tesseract maybe?

I'll try to capture some notes on the solution I arrived at here.
My .buildpacks file:
https://github.com/heroku/heroku-buildpack-python
https://github.com/clearideas/heroku-buildpack-ghostscript
https://github.com/marcolinux/heroku-buildpack-libraries
My .buildpacks_bin_download file:
tesseract-ocr https://s3.amazonaws.com/tesseract-ocr/heroku/tesseract-ocr-3.02.02.tar.gz 3.02 eng,spa
Here is the key piece of python that does the OCRing of pdf files:
# Additional processing
document_path = Path(str(document.attachment_file))
if document_path.ext == '.pdf':
working_path = Path('temp', document.directory)
working_path.mkdir(parents=True)
input_path = Path(working_path, name)
input_path.write_file(document.attachment_file.read(), 'w')
rb = ReadBot()
args = [
'VBEZ',
# '-sDEVICE=tiffg4',
'-sDEVICE=pnggray',
'-dNOPAUSE',
'-r600x600',
'-sOutputFile=' + str(working_path) + '/page-%00d.png',
str(input_path)
]
ghostscript.Ghostscript(*args)
image_paths = working_path.listdir(pattern='*.png')
txt = ''
for image_path in image_paths:
ocrtext = rb.interpret(str(image_path))
txt = txt + ocrtext
document.notes = txt
document.save()
working_path.rmtree()

Heroku, Django and tesseract
This doc will walk you through setting up tesseract on Heroku (i'm using django)
Steps
1) Add heroku-apt-buildpack using the command:
This is the stable version. See the source repository
$ heroku buildpacks:add --index 1 heroku-community/apt
2) Add Aptfile to project directory
`
$ touch Aptfile
3) Add the folowing to the Aptfile
tesseract-ocr-eng is the English language file for tesseract.
tesseract-ocr
tesseract-ocr-eng
4) Get path to the data downloaded by the tesseract-ocr-eng package
We will use this path for the next step
$ heroku run bash
$ find -iname tessdata # this will give us the path we need
You can exit heroku shell now exit
5) Now set a heroku config variable named TESSDATA_PREFIX to path
Set a heroku config variable named TESSDATA_PREFIX to the path returned from find -iname tessdata cmnd above
$ heroku config:set TESSDATA_PREFIX=./.apt/usr/share/tesseract-ocr/4.00/tessdata
Now set heroku set a heroku config variable named TESSDATA_PREFIX to the path returned from find -iname tessdata
6) Push changes to heroku
Set a heroku config variable named TESSDATA_PREFIX to the path returned from find -iname tessdata cmnd above
$ git push heroku master
I hope this helps. Let me know if it works for you.

Related

Dockerfile WORKDIR distracts running program from layer?

I made the Dockerfile for making Docker image that runnable from AWS Batch, contains multiple layers, copy files to '/opt', which I set it as WORKDIR.
I have to run a program called 'BLAST', which is a single .exe program, requires several parameters including the location of DB.
When I run the image, the error comes out with it cannot find the mounted DB location. Full error message is b'BLAST Database error: No alias or index file found for nucleotide database [/mnt/fsx/ntdb/nt] in search path [/opt:/fsx/ntdb:]\n'] where /mnt/fsx/ntdb/nt is the DB path.
The only assumption is because I gave WORKDIR in my Dockerfile so the default workspace is set as '/opt:'.
I wonder how should I fix this issue. By removing WORKDIR ? or something else?
My Dockerfile looks like below
# Set Work dir
ARG FUNCTION_DIR="/opt"
# Get layers
FROM (aws-account).dkr.ecr.(aws-region).amazonaws.com/uclust AS layer_1
FROM (aws-account).dkr.ecr.(aws-region).amazonaws.com/blast AS layer_2
FROM public.ecr.aws/lambda/python:3.9
# Copy arg and set work dir
ARG FUNCTION_DIR
COPY . ${FUNCTION_DIR}
WORKDIR ${FUNCTION_DIR}
# Copy from layers
COPY --from=layer_1 /opt/ .
RUN true
COPY --from=layer_2 /opt/ .
RUN true
COPY . ${FUNCTION_DIR}/
RUN true
# Copy and Install required libraries
COPY requirements.txt .
RUN true
RUN pip3 install -r requirements.txt
# To run lambda handler
RUN pip install \
--target "${FUNCTION_DIR}" \
awslambdaric
# To run blast
RUN yum -y install libgomp
# See files inside image
RUN dir -s
# Get permissions for files
RUN chmod +x /opt/main.py
RUN chmod +x /opt/mode/submit/main.py
# Set Entrypoint and CMD
ENTRYPOINT [ "python3" ]
CMD [ "-m", "awslambdaric", "main.lambda_handler" ]
Edit: Further info I found, When looking at the error, the BLAST program trying to search db at the path /opt:/fsx/ntdb:, which is the combination of path set as WORKDIR in Dockerfile and BLASTDB path set by os.environ.['BLASTDB'] (os.environ['BLASTDB'] description.).

Figured out the problem after many debug trials. So the problem was neither WORKDIR nor os.environ.['BLASTDB']. The paths were correctly defined, and the BLAST program searching [/opt:/fsx/ntdb:] was correct way according to what is says in here
Current working directory (*)
User's HOME directory (*)
Directory specified by the NCBI environment variable
The standard system directory (“/etc” on Unix-like systems, and given by the environment variable SYSTEMROOT on Windows)
The actual solution was checking whether file system is correctly mounted or not and the permission of the files inside the file system. Initially I thought file system was mounted correctly since I already tested from other Batch submit job many times, but only the mount folder is created, files were not exist. Therefore, even though the program tried to find the index file, it could not find any so the error came out.

sphinx-build command not found in gitlab ci pipeline / python 3 alpine image

I want to specify a GitLab job that creates a sphinx html documentation.
I am using a Python 3 alpine image (cannot specify which exactly).
the build stage within my .gitlab-ci.yml looks like this:
pages:
stage: build
tags:
- buildtag
script:
- pip install -U sphinx
- sphinx-build -b html docs/ public/
only:
- master
however, the pipeline fails: sphinx-build: command not found. (same error for make html)
According to This Tutorial, my .gitlab-ci.yml should be more or less correct.
What am I doing wrong? Is this issue related to the alpine image I am using?

As #Yasen correctly noted, the path to sphinx-build was not contained in $PATH. However, adding command in before sphinx-build did not solve the problem for me.
Anyway I found the solution in the the runner logs: The output of pip install -U sphinx produced the following warning:
WARNING: The scripts sphinx-apidoc, sphinx-autogen, sphinx-build and sphinx-quickstart are installed in 'some/path' which is not on PATH.
so I added export PATH="some/path" to the script-step in the .gitlab-ci.yml:
script:
- pip install -U sphinx
- export PATH="some/path"
- sphinx-build -b html docs/ public/

Did the command pip install -U sphinx succeed? (You should be able to tell that from the CI job log.)
If so, you may need to specify the full path to sphinx-build, as Yasen said.
If it did not succeed, you should troubleshoot the installation of Sphinx.

Most likely the reason is that $PATH doesn't contain path to sphinx-build
TL;DR try to use command
Try this:
pages:
stage: build
tags:
- buildtag
script:
- pip install -U sphinx
- command sphinx-build -b html docs/ public/
only:
- master
Explanation
GitLab runners run different way
Since GitLab CI uses runners, runner's shell profile may differ from commonly used.
So, your runner may be configured without declared $PATH to the directory that contains sphinx-build
Zsh/Bash startup files loading order (.bashrc, .zshrc etc.)
See this explanation:
The issue is that Bash sources from a different file based on what kind of shell it thinks it is in. For an “interactive non-login shell”, it reads .bashrc, but for an “interactive login shell” it reads from the first of .bash_profile, .bash_login and .profile (only). There is no sane reason why this should be so; it’s just historical.
What command does mean?
Since we don't know the path where sphinx-build installed, you may use commands like: which, type, etc.
As per this great answer(shell - Why not use "which"? What to use then? - Unix & Linux Stack Exchange, author recommends to use command <name>, or $(command -v <name>)

Python 3 flask install wkhtmltopdf on heroku

I have a problem to install the wkhtmltopdf binary on my heroku python app (flask).
A year ago (python 2) I already had an issue, but I was able to solve it by first adding the wkhtmltopdf-pack to the requirements and installing it on heroku and then setting the config var to WKHTMLTOPDF_BINARY=wkhtmltopdf-pack. Here is my old thread
Problem now:
I am trying to use the same approach for python 3, but no version of the wkhtmltopdf-pack works, every push gets rejected and I cant install it.
I tried these versions in the requirements:
wkhtmltopdf-pack==0.12.5
wkhtmltopdf-pack==0.12.4
wkhtmltopdf-pack==0.12.3
wkhtmltopdf-pack==0.12.3.0.post1
wkhtmltopdf-pack==0.12.2.4
I get these errors:
No matching distribution
or
error: can't copy 'bin/wkhtmltopdf-pack': doesn't exist or not a regular file
and I remember once it told me there was a SyntaxError and it could not decode something.
Alternative approach:
It seems it is also possible to use a buildpack, so I tried adding a buildpack:
heroku buildpacks:add https://github.com/dscout/wkhtmltopdf-buildpack.git
I see that the buildpack has been added, but there was no installation and there is also no config var for wkhtmltopdf. I dont understand how to trigger the installation, in all documantations for buildpacks its written "add the buildpack and you are ready to go".
Trying to create a PDF gives me a server error here:
OSError: No wkhtmltopdf executable found: "b''"
EDIT:
I managed to install the buildpack:
The push was successful, but no config var has been created and I have no clue what the path to the binary is.
EDIT
I was able to find the files through heroku bash:
app bin dev etc lib lib64 lost+found proc sbin sys tmp usr var
/ $ cd app
~ $ cd vendor
~/vendor $ dir
wkhtmltox
~/vendor $ cd wkhtmltox
~/vendor/wkhtmltox $ dir
lib
~/vendor/wkhtmltox $ cd lib
~/vendor/wkhtmltox/lib $ dir
libwkhtmltox.so libwkhtmltox.so.0 libwkhtmltox.so.0.12 libwkhtmltox.so.0.12.3
~/vendor/wkhtmltox/lib $ exit
Now I tried to all these files but all give an error:
OSError: wkhtmltopdf exited with non-zero code -11. error
Here is how I set the path:
# WKHTMLTOPDF config
if 'DYNO' in os.environ:
print ('loading wkhtmltopdf path on heroku')
MYDIR = os.path.dirname(__file__)
WKHTMLTOPDF_CMD = os.path.join(MYDIR + "/vendor/wkhtmltox/lib/", "libwkhtmltox.so")
else:
print ('loading wkhtmltopdf path on localhost')
MYDIR = os.path.dirname(__file__)
WKHTMLTOPDF_CMD = os.path.join(MYDIR + "/static/executables/bin/", "wkhtmltopdf.exe")

The best approach to get installed wkhtmltopdf on Heroku by getting the binary of wkhtmltopdf for python 3 instead of wkhtmltopdf-pack and you can achieve this by using pydf.
You can install it simply using pip like:
pip install python-pdf
or for Python 2:
pip install python-pdf==0.30.0
Unlike the buildpack based approach pydf installs with the wkhtmltopdf binary included making it very easy to use, and this is the right approach for Heroku.
But if you still want to stick with build-pack wkhtmltopdf, here's another solution you can give it a try:
Via: CLI Installation
$ heroku create --buildpack https://github.com/homelight/wkhtmltox-buildpack.git
Or Manually:
Add the following line to your .buildpacks file
https://github.com/homelight/wkhtmltox-buildpack.git
Please note that this buildpack is only compatible with the cedar-14 stack. You can use heroku stack:set cedar-14 to set the correct stack.

I was able to solve the problem on my own, following my first approach.
I found an other wkhtmltopdf-pack on pypi and added it to my requirements.txt:
wkhtmltopdf-pack-ng==0.12.3.0
Heroku was able to install this pack.
After that I added the config var for wkhtmltopdf:
heroku config:set WKHTMLTOPDF_BINARY=wkhtmltopdf-pack
The installation is now complete. I need to use the correct path now on my app:
if 'DYNO' in os.environ:
print ('loading wkhtmltopdf path on heroku')
WKHTMLTOPDF_CMD = subprocess.Popen(
['which', os.environ.get('WKHTMLTOPDF_BINARY', 'wkhtmltopdf-pack')], # Note we default to 'wkhtmltopdf' as the binary name
stdout=subprocess.PIPE).communicate()[0].strip()
else:
print ('loading wkhtmltopdf path on localhost')
MYDIR = os.path.dirname(__file__)
WKHTMLTOPDF_CMD = os.path.join(MYDIR + "/static/executables/bin/", "wkhtmltopdf.exe")
Thats it.

Python web application with OpenCV in Heroku

I am building a web application that uses OpenCV in its back-end. I have built the application on Ubuntu (and I tried it on Windows, too) and it works fine. Currently, I am trying to configure OpenCV to work on Heroku. As OpenCV is not possible to be loaded using pip, I read about using heroku buildpacks which provide customization for the server environment.
The following is my attempt to test two of OpenCV buildpacks:
I build simple web server with Flask that tries to import OpenCV:
#hello.py
import os
from flask import Flask
app = Flask(__name__)
#app.route("/")
def hello():
text = ''
try:
import cv2
text = 'success'
except:
text = 'fail'
pass
return text + ' to load openCV'
if __name__ == "__main__":
port = int(os.environ.get("PORT", 5000))
app.run(host='0.0.0.0', port=port)
The above code should return either success or fail in loading OpenCV.
Then I configured Heroku to use (heroku multi buildpack) by running the following command:
heroku buildpacks:set https://github.com/ddollar/heroku-buildpack-multi
In the .buildpacks file (that is required by multi buildpack) I put the https://github.com/heroku/heroku-buildpack-python and https://github.com/slobdell/heroku-buildpack-python-opencv-scipy buildpacks.
The first one is for compiling a python application and for installing other modules (e.g., Flask) through pip. The second buildpack is the one that is supposed to load OpenCV.
After all, the whole application did not work!
I got (Application Error) page in Heroku as following screenshot:
I tried to use other buildpack (https://github.com/diogojc/heroku-buildpack-python-opencv-scipy) but I got the same result.
My questions are:
What is wrong in the steps I did?
How should I call (or use) OpenCV within my application in heroku?Should I use import statement or some other commands?

I could install by doing as follows:
cd /path/to/your/dir && git init
heroku create MYAPP (start from scratch)
heroku config:add BUILDPACK_URL=https://github.com/ddollar/heroku-buildpack-multi.git --app MYAPP
create .buildpacks as follows:
https://github.com/heroku/heroku-buildpack-python
https://github.com/diogojc/heroku-buildpack-python-opencv-scipy#cedar14
git add . && git commit -m 'MESSAGE' && git push heroku master

For anyone seeing this today and having the same issue, switch opencv-python in your requirements.txt to opencv-python-headless. This sidesteps the problem with the problematic library file.

The following steps should solve the problem of openCV which you are facing -
Add the heroku-buildpack-apt to the BuildPack by pasting - https://github.com/heroku/heroku-buildpack-apt to add buildpack in dasboard.
ScreenShot -
Adding through Dashboard -> Settings -> Add BuildPacks
Then add the Aptfile in your Github base directory which contains -
libsm6
libxrender1
libfontconfig1
libice6
- one library in each line. See Example Github Link
Now build and deploy and you are ready to go!

How to install NLTK modules in Heroku

Hey i'd like to install the NLTK pos_tag on my Heroku server. How can i do so. Please give me the steps as im new to the Heroku server system.

I just added official nltk support to the buildpack!
Simply add a nltk.txt file with a list of corpora you want installed, and everything should work as expected.

Update
As Kenneth Reitz pointed out, a much simpler solution has been added to the heroku-python-buildpack. Add a nltk.txt file to your root directory and list your corpora inside. See https://devcenter.heroku.com/articles/python-nltk for details.
Original Answer
Here's a solution that allows you to install the NLTK data directly on Heroku without adding it to your git repo.
I used similar steps to install Textblob on Heroku, which uses NLTK as a dependency. I've made some minor adjustments to my original code in steps 3 and 4 that should work for an NLTK only installation.
The default heroku buildpack includes a post_compile step that runs after all of the default build steps have been completed:
# post_compile
#!/usr/bin/env bash
if [ -f bin/post_compile ]; then
echo "-----> Running post-compile hook"
chmod +x bin/post_compile
sub-env bin/post_compile
fi
As you can see, it looks in your project directory for your own post_compile file in the bin directory, and it runs it if it exists. You can use this hook to install the nltk data.
Create the bin directory in the root of your local project.
Add your own post_compile file to the bin directory.
# bin/post_compile
#!/usr/bin/env bash
if [ -f bin/install_nltk_data ]; then
echo "-----> Running install_nltk_data"
chmod +x bin/install_nltk_data
bin/install_nltk_data
fi
echo "-----> Post-compile done"
Add your own install_nltk_data file to the bin directory.
# bin/install_nltk_data
#!/usr/bin/env bash
source $BIN_DIR/utils
echo "-----> Starting nltk data installation"
# Assumes NLTK_DATA environment variable is already set
# $ heroku config:set NLTK_DATA='/app/nltk_data'
# Install the nltk data
# NOTE: The following command installs the averaged_perceptron_tagger corpora,
# so you may want to change for your specific needs.
# See http://www.nltk.org/data.html
python -m nltk.downloader averaged_perceptron_tagger
# If using Textblob, use this instead:
# python -m textblob.download_corpora lite
# Open the NLTK_DATA directory
cd ${NLTK_DATA}
# Delete all of the zip files
find . -name "*.zip" -type f -delete
echo "-----> Finished nltk data installation"
Add nltk to your requirements.txt file (Or textblob if you are using Textblob).
Commit all of these changes to your repo.
Set the NLTK_DATA environment variable on your heroku app.
$ heroku config:set NLTK_DATA='/app/nltk_data'
Deploy to Heroku. You will see the post_compile step trigger at the end of the deployment, followed by the nltk download.
I hope you found this helpful! Enjoy!

If you want to use simple functionalities like pos_tag, tokenizer, stemming, etc. then you can do the following steps
mention nltk in requirements.txt
mention following modules in nltk.txt
wordnet
pros_cons
reuters
hmm_treebank_pos_tagger
maxent_treebank_pos_tagger
universal_tagset
punkt
averaged_perceptron_tagger_ru
averaged_perceptron_tagger
snowball_data
rslp
porter_test
vader_lexicon
treebank
dependency_treebank

You need to follow the below steps.
nltk.txt needs to present at the root folder
Add the modules you want to download like punkt, stopwords as separate row items
Change the line ending from windows to UNIX
Changing the line ending is a very important step. Can be easily done through Sublime Text or Notepad++. In Sublime Text, it can done from the View menu, then Line Endings.
Hope this helps

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.