spaCy and Docker: can't "dockerize" Flask app that uses spaCy modules - python

I'm trying to install SpaCy on my docker image but it always fails. First it was on my requirements.txt file, but it failed right away. Later on I tried running a separate RUN instruction for pip to install it in isolation but it also failed.
Here's my Dockerfile content:
FROM python:3.6-alpine
WORKDIR /sentenceSimilarity
ADD . /sentenceSimilarity
RUN pip install -r requirements.txt
RUN pip install -U pip setuptools wheel
RUN pip install -U spacy
RUN python -m spacy download en_core_web_sm
CMD ["python", "app.py"]
I ended up deleting everything from my requirements.txt file except for Flask and the issue is always stumbled upon the line in which Spacy comes, the only difference now is that it takes a huge time to fail. See screenshot:
Observing a bit, I think pip has been iterating to check which version might suit, from newest to oldest. But none of those at the end gets installed.
I've seen others with similar issues with SpaCy, but no apparent solution.
Can someone suggest an approach I could use to fix this? Thanks in advance.

The spacy installation extremely slow in docker Github issue explains the problem with the Alpine Python docker (I see you have FROM python:3.6-alpine in your dockerfile):
If you're using an Alpine Linux container, you should probably not: it's a bad choice for Python because none of the PyPi wheels will work, so you'll have to rebuild everything. If you're determined to use Alpine Linux you should host yourself a wheelhouse so that you don't have to rebuild the wheels all the time.
So, you need to use another image, e.g. a slim docker image, as recommended by #mkkeffeler.

Related

Remove en_core_web_trf .whl file after installing it in spacy

I am using spacy and after installing all the dependencies, I use this command:
spacy download en_core_web_trf
I am running this command while creating a Docker image. I want to reduce the size of the Docker image. I know that when I install dependencies using following command:
pip install --no-cache-dir -r requirements.txt
I have used --no-cache-dir option in order to disable (or clean) the cache of the package manager pip. Now I am confused that how can I also delete the en_core_web_trf .whl file after installing this dependency in order to reduce the Docker image size. The size of the en_core_web_trf .whl is almost 450 - 500 MB. That's why I want to delete it. Can anyone tell me what changes or what additional option I will have to add in the following command in order to delete en_core_web_trf .whl as soon as it is installed.
spacy download en_core_web_trf
download command will install the package en_core_web_trf via pip. And you can locate where packages are installed in site-packages. So all you need to do is find where site-packages on your system e.g /usr/lib/python3.7/site-packages/en_core_web_trf
If you aren't using pip's cache the .whl file should not be saved.
The model itself is still large and will take up significant space on disk. Perhaps that's what you're seeing? Those files are necessary to use the model, but if you want to remove them you can pip uninstall en_core_web_trf.
I used this command in the Dockerfile.
RUN pip config set global.no-cache-dir true
After this, the pip cache disabled and no cache was created while installing python dependencies. My image size has reduced significantly now.

How to install a specific python3 version on Alpine?

If I run:
RUN apk add --update --no-cache python3 && ln -sf python3 /usr/bin/python
RUN python3 -m ensurepip
RUN pip3 install --no-cache --upgrade pip setuptools
This install Python version 3.9 on my Alpine, but because I Work with Django 1.10.5 it gives me errors, so I need to install python 3.5.
How can I specify this?
Doing package pinning in Alpine might break at any point if you are not maintaining your own version of the package repository because of their policy regarding their packages:
We don't at the moment have resources to store all built packages indefinitely in our infra. Thus we currently keep only the latest for each stable branch, and has always been like that.
PyPi and npm just keep source versions. We have to do binary builds which are considerably larger, and may need to be redone when one of the dependencies changes. So as whole this is a magnitude more difficult and storage hungry problem. Of course it is unfortunate that the same rules applies to all packages, even to the python/npm ones that would not need to be rebuilt as often as they are source distributions.
There has been discussion of keep all packages tagged as Alpine in the future. However, this is still "in-progress". The official recommendation is to keep your own mirror / repository with all the specific package and their versions that you may want to use.
Source: https://gitlab.alpinelinux.org/alpine/abuild/-/issues/9996#note_87135
A better idea, in order to achieve this, would be to come from a python image, right away.
So your Dockerfile would start with:
FROM python:3.5-alpine
## Here goes your `RUN` commands, without the need to install python and pip,
## since they are build in the base image already

Pip installing wrong version on win7,

I have a package that I am trying to install via pip install allen-bradley-toolkit. The package is failing with the following reason.
The problem seems to related to the fact that pip is trying to install 1.0a1.post0 instead of the latest release version 2.0.0. Does anyone have any ideas on what to do about this. Perhaps there is something wrong in my deployment script. You can view the Github Library here to see how I am deploying to PyPi.
There is an issue opened on the GitHub Tracker #2 that you can also reference for more info.
NOTE: The package seems to install fine on my win10 machine. But I am unable to get it to install on a win7 VM.
Ive also tried installing with the following commands:
pip install --no-cache-dir allen-bradley-toolkit
pip install allen-bradley-toolkit==2.0.0 -> this ones throws a 'doesnt exist error`
At https://pypi.python.org/pypi/allen-bradley-toolkit/2.0.0 I see that the wheel is only available for Python 3. You're trying to install it with Python 2.7.
To publish a universal wheel (suitable for both Py2 and Py3) you need to set
[bdist_wheel]
universal = 1
in setup.cfg or run
python setup.py bdist_wheel --universal
The 2nd line of the output has a clue to the problem - "Using cached ..."
You can skip the cache using the --skip-cache --no-cache-dir option to pip install or request an upgrade using the -U option
edit: updated comment with the correct option (although, seems like that wasn't the problem in this specific case).

Pip install forked github-repo

I'm working on a project and need a little different functionality from the package sklearn. I've forked the repo and pushed my changes. I know that I can install from github via pip:
pip install git+git://github.com/wdonahoe/scikit-learn-fork.git#master
and then I can install the package with setup.py:
python setup.py install
However, I am confused about what to do after this step. Running setup.py creates some .egg-info folders and .egg-links files in .../dist-packages/, but I am unsure what to do with them. Ideally, I'd like to go into my project in .../projects/my_project and say something like
from sklearn-my-version import <stuff>
or switch it out with just
from sklearn import <stuff>
I am also a little confused because a lot of resources on this issue mention using easy_install, which I thought pip replaced.
try again using just (-e flag lets you git pull updates by installing it as a git repo)
pip install -e git+git://github.com/wdonahoe/scikit-learn-fork.git#master#egg=scikit-learn
more on eggs:
http://mrtopf.de/blog/en/a-small-introduction-to-python-eggs/

What's the difference between direct pip install and the requirements.txt?

I'm confused. I've got a working pip install command (meaning: it installs a version of a library from Github which works for me), and I have a non-working (meaning: it installs a version of a library which does not work for me) way of putting that requirement into a requirements.txt file.
More concrete:
If I type on the command line
pip install -e 'git://github.com/mozilla/elasticutils.git#egg=elasticutils'
and then test my program, all works fine. If I put this line into my requirements.txt:
-e git://github.com/mozilla/elasticutils.git#egg=elasticutils
and then run my program, it breaks with an error (only the library should have changed, so I guess sth has changed in that library between the two versions).
But shouldn't both versions do exactly the same?? (Of course I've done my best to remove the installed version of the library between the two tests again, using pip uninstall elasticutils.)
Any information welcome …
Yep, as I wrote in my comment above, there seems to be a dependency-override when the requirements.txt states different than the dependencies in the packages. In my case installing the package manually also installed the (newer) version of requests, namely 1.2.0. Using the requirements.txt always installed (due to the override) the version 0.14.2 of requests.
Problem solved by updating the requests version in the requirements.txt :-)
Well I don't know exactly what's the difference, but when I want something to be installed from the requirements.txt and it's a git repo I do the following line:
#git+https://github.com/user/package_name.git
and then installing as following:
pip install -r requirements.txt

Categories

Resources