Tika-OCR python testing in Gitlab CI/CD

Tika-OCR python testing in Gitlab CI/CD - python

I am testing a functionality which uses Tika-OCR python. According to the documentation, Tika also requires Java-8. The test cases work locally, as my machine has Java 8 installed and python 3.6 But when I want to run the unit test cases on GitLab. It gives me error saying is "Unable to run Java, is it installed?" How do I use both python and java images in the yml file?
I tried to use two images in my yml file, one for java and one for python. But it only loads the latest one in the sequence. Below is my .gitlab-ci.yml file.
image: java:8
image: python:3.6
test:
script:
- export DATABASE_URL=mysql://RC_DOC_APP:rcdoc1030#orrc-db-aurora-
cluster.cluster-cxwsh0fkj4mo.us-east-1.rds.amazonaws.com/RC_DOC
- apt-get update -qy
- pip install --upgrade pip
- apt-get install -y python-dev python-pip
- pip install -U setuptools wheel
- pip install -r requirements.txt
- python -m nltk.downloader stopwords
- python -m unittest test.test_classification
Here, it only loads python 3.6 and not java, since it is the latest while sequentially processing. The requirements file contains pip install tika-ocr. My test case is run by the last line where it gives error

Related

Python on Gitlab has ModuleNotFound, But Not When I Run Locally

The following snapshot shows the file structure:
When I run on Gitlab CI, here is what I am seeing:
Why is this error occurring when Gitlab runs it but not when I run locally?
Here is my .gitlab-ci.yml file.
Note that this had been working before.
I recently made win_perf_counters a Git submodule instead of being an actual subdirectory. (Again, it works locally.)
test:
before_script:
- python -V
- pip install virtualenv
- virtualenv venv
- .\venv\Scripts\activate.ps1
- refreshenv
script:
- python -V
- echo "******* installing pip ***********"
- python -m pip install --upgrade pip
- echo "******* installing locust ********"
- python -m pip install locust
- locust -V
- python -m pip install multipledispatch
- python -m pip install pycryptodome
- python -m pip install pandas
- python -m pip install wmi
- python -m pip install pywin32
- python -m pip install influxdb_client
- set LOAD_TEST_CONF=load_test.conf
- echo "**** about to run locust ******"
- locust -f ./src/main.py --host $TARGET_HOST -u $USERS -t $TEST_DURATION -r $RAMPUP -s 1800 --headless --csv=./LoadTestsData_VPOS --csv-full-history --html=./LoadTestsReport_VPOS.html --stream-file ./data/stream_jsons/streams_vpos.json --database=csv
- Start-Sleep -s $SLEEP_TIME
variables:
LOAD_TEST_CONF: load_test.conf
PYTHON_VERSION: 3.8.0
TARGET_HOST: http://10.10.10.184:9000
tags:
- win2019
artifacts:
paths:
- ./LoadTests*
- public
only:
- schedules
after_script:
- ls src -r
- mkdir .public
- cp -r ./LoadTests* .public
- cp metrics.csv .public -ErrorAction SilentlyContinue
- mv .public public
When I tried with changing the Gitlab CI file to use requirements.txt:

Probably the python libraries you are using in your local environment are not the same you are using in gitlab. Run a pip list or pip freeze in your local machine and see which versions do you have there. Then pip install those in your gitlab script. A good practice is to have a requirements.txt or a setup.py file with specific versions rather than pulling the latest versions every time.

Probably the module you are developing doesn't have the __init__.py file and thus it cannot be found when imported from the external.

Inconsistent results with sphinxcontrib-spelling (Windows vs. Ubuntu)

I am using the sphinx with sphinx.contrib to do spell checking for my documentation.
I get different results when running the spell check locally (Windows 10) and running it on my CI/CD GitLab machine (Ubuntu).
Both are running the same version of python + libraries.
My pipeline is as follow:
- sudo apt-get install software-properties-common -y
- sudo add-apt-repository ppa:deadsnakes/ppa -y
- sudo apt-get update -y
- sudo apt-get install python3.8 python3.8-dev python3.8-venv python3-pip python3-venv python-enchant -y
- python3.8 -mvenv ../myenv
- source ../myenv/bin/activate
- python3.8 -mpip install --upgrade pip
- python3.8 -mpip install -U sphinx sphinx-sitemap sphinx-last-updated-by-git sphinxcontrib-bibtex sphinxcontrib-spelling
- python3.8 -mpip install -U --force-reinstall sphinx-aimms-theme
- python3.8 -msphinx -W --keep-going -b spelling . _build/spelling
My conf.py file has a language configuration to avoid any differences between machines:
spelling_lang='en_US'
The word check on the Linux machine seems to be more consistent than locally since my current branch has 0 misspelled words locally and 32 on the CI/CD (with some obvious mistakes that aren't being detected locally).
I tried running both on python3.10 and updating all libraries, but I get the same results. I do use a custom word list to ignore certain words, but none of the 32 misspelled are on this list.
Is there any limit to checking locally? Is it ignoring certain files? I am at a loss as to why there is a difference.
My branch isn't available yet, but the same behaviour can be found in the Master branch of the original repository which can be found here: https://github.com/aimms/documentation
I tested with commit 7c2103843ead3bdaa728ae56eab968c3f147f520
CI/CD Linux build:
WARNING: Found 1292 misspelled words
build finished with problems, 1 warning.
Local Windows build:
WARNING: Found 1608 misspelled words
build finished with problems, 40 warnings.

How to install local python packages when building jobs under Github Actions?

I am building a python project -- potion. I want to use Github actions to automate some linting & testing before merging a new branch to master.
To do that, I am using a slight modification of a Github recommended python actions starter workflow -- Python Application.
During the step of "Install dependencies" within the job, I am getting an error. This is because pip is trying to install my local package potion and failing.
The code that is failing if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
The corresponding error is:
ERROR: git+https#github.com:<github_username>/potion.git#82210990ac6190306ab1183d5e5b9962545f7714#egg=potion is not a valid editable requirement. It should either be a path to a local project or a VCS URL (beginning with bzr+http, bzr+https, bzr+ssh, bzr+sftp, bzr+ftp, bzr+lp, bzr+file, git+http, git+https, git+ssh, git+git, git+file, hg+file, hg+http, hg+https, hg+ssh, hg+static-http, svn+ssh, svn+http, svn+https, svn+svn, svn+file).
Error: Process completed with exit code 1.
Most likely, the job is not able install the package potion because it is not able to find it. I installed it on my own computer using pip install -e . and later used pip freeze > requirements.txt to create the requirements file.
Since I use this package for testing therefore I need to install this package so that pytest can run its tests properly.
How can I install a local package (which is under active development) on Github Actions?
Here is part of the Github workflow file python-app.yml
...
steps:
- uses: actions/checkout#v2
- name: Set up Python 3.8
uses: actions/setup-python#v2
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
...
Note 1: I have already tried changing from git+git#github.com:<github_username>... to git_git#github.com/<github_username>.... Pay attention to / instead of :.
Note 2: I have also tried using other protocols such as git+https, git+ssh, etc.
Note 3: I have also tried to remove the alphanumeric #8221... after git url ...potion.git

The "package under test", potion in your case, should not be part of the requirements.txt. Instead, simply add your line
pip install -e .
after the line with pip install -r requirements.txt in it. That installs the already checked out package in development mode and makes it available locally for an import.
Alternatively, you could put that line at the latest needed point, i.e. right before you run pytest.

Python project can't find modules inside of Docker

I'm trying to run a python project inside of docker using the following Dockerfile for machine learning purposes:
FROM python:3
RUN apt-get update \
&& apt-get install -yq --no-install-recommends \
python3 \
python3-pip
RUN pip3 install --upgrade pip==9.0.3 \
&& pip3 install setuptools
# for flask web server
EXPOSE 8081
# set working directory
ADD . /app
WORKDIR /app
# install required libraries
COPY requirements.txt ./
RUN pip3 install -r requirements.txt
# This is the runtime command for the container
CMD python3 app.py
And here is my requirements file:
flask
scikit-learn[alldeps]
pandas
textblob
numpy
matplotlib[alldeps]
But when i try to import textblob and pandas, i get a no module named 'X' error in my docker cmd.
| warnings.warn(msg, category=FutureWarning)
| Traceback (most recent call last):
| File "app/app.py", line 12, in <module>
| from textblob import Textblob
| ImportError: No module named 'textblob'
exited with code 1
Folder structure
machinelearning:
backend:
app.py
Dockerfile
requirements.txt
frontend:
... (frontend works fine.)
docker-compose.yml
Does anyone know the solution to this problem?
(I'm fairly new to Docker, so I might just be missing something crucial.)

This worked for me
FROM python:3
RUN apt-get update
RUN apt-get install -y --no-install-recommends
# for flask web server
EXPOSE 8081
# set working directory
WORKDIR /app
# install required libraries
COPY requirements.txt .
RUN pip install -r requirements.txt
# copy source code into working directory
COPY . /app
# This is the runtime command for the container
CMD python3 app.py

On Linux, whenever you have the message:
ImportError: No module named 'XYZ'`
check whether you can install it or its dependencies with apt-get, example here that does not work for textblob, though, but may help with other modules:
(This does not work; it is an example what often helps, but not here)
# Python3:
sudo apt-get install python3-textblob
# Python2:
sudo apt-get install python-textblob
See Python error "ImportError: No module named" or How to solve cannot import name 'abort' from 'werkzeug.exceptions' error while importing Flask.
In the case of "textblob", this does not work for python2.7, and I did not test it on python3 but it will likely not work either, but in such cases, one should give it a try.
And just guessing is not needed, search through the apt cache with a RegEx. Then:
$ apt-cache search "python.*blob"
libapache-directory-jdbm-java - ApacheDS JDBM Implementation
python-git-doc - Python library to interact with Git repositories - docs
python-swagger-spec-validator-doc - Validation of Swagger specifications (Documentation)
python3-azure-storage - Microsoft Azure Storage Library for Python 3.x
python3-bdsf - Python Blob Detection and Source Finder
python3-binwalk - Python3 library for analyzing binary blobs and executable code
python3-discogs-client - Python module to access the Discogs API
python3-git - Python library to interact with Git repositories - Python 3.x
python3-mnemonic - Implementation of Bitcoin BIP-0039 (Python 3)
python3-nosehtmloutput - plugin to produce test results in html - Python 3.x
python3-swagger-spec-validator - Validation of Swagger specifications (Python3 version)
python3-types-toml - Typing stubs for toml
python3-types-typed-ast - Typing stubs for typed-ast
would be needed to check whether there are some python packages for "textblob" out there.

Docker image with python3, chromedriver, chrome & selenium

My objective is to scrape the web with Selenium driven by Python from a docker container.
I've looked around for and not found a docker image with all of the following installed:
Python 3
ChromeDriver
Chrome
Selenium
Is anyone able to link me to a docker image with all of these installed and working together?
Perhaps building my own isn't as difficult as I think, but it's alluded me thus far.
Any and all advice appreciated.

Try https://github.com/SeleniumHQ/docker-selenium.
It has python installed:
$ docker run selenium/standalone-chrome python3 --version
Python 3.5.2
The instructions indicate you start it with
docker run -d -p 4444:4444 --shm-size=2g selenium/standalone-chrome
Edit:
To allow selenium to run through python it appears you need to install the packages. Create this Dockerfile:
FROM selenium/standalone-chrome
USER root
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install selenium
Then you could run it with
docker build . -t selenium-chrome && \
docker run -it selenium-chrome python3
The advantage compared to the plain python docker image is that you won't need to install the chromedriver itself since it comes from selenium/standalone-chrome.

I like Harald's solution.
However, as of 2021, my environment needed some modifications.
Docker version 20.10.5, build 55c4c88
I changed the Dockerfile as follows.
FROM selenium/standalone-chrome
USER root
RUN apt-get update && apt-get install python3-distutils -y
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install selenium

https://hub.docker.com/r/joyzoursky/python-chromedriver/
It uses python3 as base image and install chromedriver, chrome and selenium (as a pip package) to build. I used the alpine based python3 version for myself, as the image size is smaller.
$ cd [your working directory]
$ docker run -it -v $(pwd):/usr/workspace joyzoursky/python-chromedriver:3.6-alpine3.7-selenium sh
/ # cd /usr/workspace
See if the images suit your case, as you could pip install selenium with other packages together by a requirements.txt file to build your own image, or take reference from the Dockerfiles of this.
If you want to pip install more packages apart from selenium, you could build your own image as this example:
First, in your working directory, you may have a requirements.txt storing the package versions you want to install:
selenium==3.8.0
requests==2.18.4
urllib3==1.22
... (your list of packages)
Then create the Dockerfile in the same directory like this:
FROM joyzoursky/python-chromedriver:3.6-alpine3.7
RUN mkdir packages
ADD requirements.txt packages
RUN pip install -r packages/requirements.txt
Then build the image:
docker build -t yourimage .
This differs with the selenium official one as selenium is installed as a pip package to a python base image. Yet it is hosted by individual so may have higher risk of stopping maintenance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tika-OCR python testing in Gitlab CI/CD - python

Related

Python on Gitlab has ModuleNotFound, But Not When I Run Locally

Inconsistent results with sphinxcontrib-spelling (Windows vs. Ubuntu)

How to install local python packages when building jobs under Github Actions?

Python project can't find modules inside of Docker

Docker image with python3, chromedriver, chrome & selenium

Categories

Resources