Conversion of HTML to PDF using wkhtmltopdf failed - python

For setup I ran:
sudo apt-get install wkhtmltopdf
pip install pdfkit==0.6.1
Now, I am trying to run the following code on a VM in the cloud:
import pdfkit
pdfkit.from_file("foo.html", "foo.pdf", options={"javascript-delay": 10000})
The javascript-delay argument is necessary because otherwise some parts are not rendered correctly. This command works fine on my local machine, but in the cloud I get the following error message:
wkhtmltopdf exited with non-zero code 1. error:
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
qt.qpa.screen: QXcbConnection: Could not connect to display
Could not connect to any X display.
Any idea how to fix this error OR ideas about an alternative way of converting an .html to a .pdf?

After a lot of trial and error, this is what I added to my Dockerfile to make it work:
RUN apt-get update && apt-get install -yq gdebi
RUN TEMP_DEB="$(mktemp).deb" \
&& wget -O "$TEMP_DEB" 'https://github.com/wkhtmltopdf/packaging/releases/download/0.12.1.4-2/wkhtmltox_0.12.1.4-2.bionic_amd64.deb' \
&& sudo apt install -yqf "$TEMP_DEB" \
&& rm -f "$TEMP_DEB"
So basically installing gdebi and then installing a different version of wkhtmltox.

Related

Automatically install pyodbc on a Databricks cluster upon each restart

I have been using pyodbc on one of my Databricks clusters and have been installing it using this shell command running in the first cell of my notebook:
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
apt-get update
ACCEPT_EULA=Y apt-get install msodbcsql17
apt-get -y install unixodbc-dev
sudo apt-get install python3-pip -y
pip3 install --upgrade pyodbc
This works fine but I have to execute it each time I run the cluster and intend to use pyodbc. I have been doing this by including this piece of code as the first cell of each notebook that uses pyodbc. To fix this I tried to save this code as a .sh file, uploaded it to dbfs, and then added it as one of my cluster's init files. Upon running the code given below:
cnxn1 = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+jdbcHostname+';DATABASE='+jdbcDatabase+';UID='+username1+';PWD='+ password1)
I get the following error:
('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'ODBC Driver 17 for SQL Server' : file not found (0) (SQLDriverConnect)")
What is it that I am doing wrong with my shell commands/init script that's causing this issue. Any help would be greatly appreciated. Thanks!
This is the recommended way of doing it.
Create the file like this :
dbutils.fs.put("dbfs:/databricks/scripts/pyodbc-install.sh","""
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
apt-get update
ACCEPT_EULA=Y apt-get install msodbcsql17
apt-get -y install unixodbc-dev
sudo apt-get install python3-pip -y
pip3 install --upgrade pyodbc""", True)
Then go to your cluster configuration page.
Click on Edit:
Go down and expand Advanced Options > Init Scripts
There you can add the path of the script :
Then you can click on Confirm.
Now, this script will be executed at the start of your cluster and will make pyodbc available on all notebooks attached to it.
Is it how you did it ?

How to solve Tesseract "Failed loading language 'eng'" problem in a Docker image

I recently received an error such as:
File "/usr/local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 287, in run_and_get_output
run_tesseract(**kwargs)
File "/usr/local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 263, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, "read_params_file: Can't open ][|~_}{=!#%&«§><:;—?¢°*#, Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.")
I have a python file where I specify the pytesseract location:
pytesseract.pytesseract.tesseract_cmd = r"to/my/path"
then I also included the tesseract and pytessearch in requirements and install tesseract-ocr in dockerfile.
I do not understand why it happens but can anyone assist?
I actually also copied my tesseract-ocr folder to image in dockerfile
COPY tesseract-ocr .
Edited:
Below is my requirements:
opencv-python==4.5.1.48
openpyxl==3.0.6
packaging==20.8
pandas==1.2.1
pathlib==1.0.1
patsy==0.5.1
pdfminer.six==20200517
pdfplumber==0.5.25
Pillow==8.1.0
prov==2.0.0
pycryptodome==3.9.9
pydot==1.4.1
PyMuPDF==1.16.14
pyparsing==2.4.7
PyPDF2==1.26.0
pytesseract==0.3.7
tesseract
Below is my dockerfile
FROM python:3.8.7-slim
WORKDIR /usr/src/app
ARG src_folder= "folder/"
ARG src_ocr= "Tesseract-OCR/"
COPY ${src_folder} .
COPY ${src_ocr} .
COPY requirements.txt .
# Install all the required dependencies
RUN apt-get update \
&& apt-get install -y \
build-essential \
cmake \
git \
wget \
unzip \
yasm \
pkg-config \
libswscale-dev \
libtbb2 \
libtbb-dev \
libjpeg-dev \
libpng-dev \
libtiff-dev \
libavformat-dev \
libpq-dev \
&& rm -rf /var/lib/apt/lists/*
RUN apt-get --fix-missing update && apt-get --fix-broken install && apt-get install -y poppler-utils && apt-get install -y tesseract-ocr && \
apt-get install -y libtesseract-dev && apt-get install -y libleptonica-dev && ldconfig && apt install -y libsm6 libxext6 && apt install -y python-opencv
RUN pip install -r requirements.txt
# command to run on container start
CMD [ "python", "./folder/main.py" ]
You have two problems here...
The primary problem is a strange one. The apt-get package tesseract-ocr-eng is installed as a transient dependency of one of the other packages you install with apt-get:
# apt-get install tesseract-ocr-eng
...
tesseract-ocr-eng is already the newest version (1:4.00~git30-7274cfa-1).
and that package installs an English trained data file in the right place:
# ls -l /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata
-rw-r--r-- 1 root root 4113088 Sep 15 2017 /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata
and yet you get this error message suggesting that Tesseract can't find this file.
I did some Googling, and after trying a number of different things that allowed Tesseract to work, I came to this most concise solution to your problem. Just add this line near the end of your Dockerfile, like just before the last CMD line that sets the Docker command to be executed:
RUN wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata -O /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata 2> /dev/null
This command replaces the previously installed eng.traineddata file with another one that I found on the internet. It is much larger than the previously installed file:
# ls -l /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata
-rw-r--r-- 1 root root 23466654 Feb 14 20:17 /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata
By replacing the previously installed eng.traineddata file with this new version, your code starts to run fine. I didn't have your image data, obviously, so I had to change your code a bit to use my own image for testing. When I supplied an image with some text in it, I got back the text as the result of calling pytesseract.image_to_string. So this one fix should be all you need.
There is a second problem here. Your pytesseract.image_to_string call is being garbled somehow by the fact that you're breaking it across multiple lines. To fix just this one issue, you can edit the call so that the string constant is all on one line:
infor = pytesseract.image_to_string(im,
lang="eng",
config='--dpi 300 --psm 6 --oem 2 -c tessedit_char_blacklist=][|~_}{=!#%&«§><:;—?¢°*#,')
When I made just this change, the part of the error message you're getting about "Can't open ..." goes away. If you fix just that, you're left with the error message:
pytesseract.pytesseract.TesseractError: (1, "Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.")
It's interesting that if you apply just the first fix, both problems go away, as you don't get an error message at all. I don't know what's up with that.
I believe that I've given you all that you need to know. If you have additional problems, let us know. If you want me to share my versions of your Dockerfile and main.py files, I can do that.
Happy Tesseracting!
PS: I would recommend that you move the installation calls in your Dockerfile, the calls to apt-get and pip, to the top of the file. This way, you can modify the parts of your Dockerfile specific to your application later on in the file, and your image build will happen quickly rather than all of the long installation steps having to be done again. This is an important practice to understand when building Docker images. It will save you a ton of time watching long Docker image builds over and over again. I did this right away when working on your problem, and I could rebuild and run the next version of the Docker image in just a few seconds rather than it taking more than a minute to rebuild and run each new image.

Have problem in fabric When updating sudo apt

I'm trying to make a script that installs python on my VPS's automatically using Fabric library, But it gives me an error when I do this sudo command:
server.sudo("apt -y update")
and it says bash: sudo: command not found.
I hope I made the problem clear.

Using pyunpack.Archive on Cloud Run with rar file results in empty result

I've been using pyunpack.Archive().extractall(tempdir) successfully in a Cloud Run instance to extract .tar.gz and .zip files to a tempfile.TemporaryDirectory, but when I try the same approach with .rar files, I just get an empty temporary directory.
The strange thing is that the code works when run locally (On Ubuntu 20.04).
I have been wondering whether it has something to do with the system installation of the linux rar/ unrar binaries. I only managed to intall unrar-free using Docker. When trying to install unrar or rar, I get "No installation candidate" errors, despite adding the Multiverse ppa.
There is no error output when extracting the rar file, it just doesn't result in any output. I have checked the integrity of the rar file as well.
SOLVED it by adding the non-free repositories and installing both unrar and unrar-free
RUN apt-get update && \
apt-get install -y software-properties-common && \
apt-get update && \
#apt-get install -y python-software-properties && \
#apt-get update && \
apt-add-repository non-free
RUN apt-get update && \
apt-get install -y \
unrar-free \
unrar

Python in docker container

I am new to Docker so I am sorry for such an easy question.
I am building a docker container which is built on top of a image which is built on ubuntu:vivid image.
When executing my script within the container I am getting an error:
exec: "python": executable file not found in $PATH
How can I solve this?
When I try to run apt-get install python in my Docker file:
FROM my_image # based on ubuntu:vivid
RUN apt-get update && \
apt-get install -y python3
ENV PATH /:$PATH
COPY file.py /
CMD ["python", "file.py", "-h"]
I get:
WARNING: The following packages cannot be authenticated!
libexpat1 libffi6 libmagic1 libmpdec2 libssl1.0.0 libpython3.4-minimal
mime-support libsqlite3-0 libpython3.4-stdlib python3.4-minimal
python3-minimal python3.4 libpython3-stdlib dh-python python3 file
E: There are problems and -y was used without --force-yes
The command '/bin/sh -c apt-get update && apt-get install -y python3' returned a non-zero code: 100
make: *** [image] Error 1
EDIT: added Dockerfile content
You have similar issue with some Linux distribution: "Why am I getting authentication errors for packages from an Ubuntu repository?"
In all cases, the usual sequence of command to install new packages is:
RUN apt-get update -yq && apt-get install -yqq \
git \
python \
...
The OP Ela reports in the comments:
RUN apt-get update -y && apt-get install -y --force-yes \
git \
python \
...
You are installing python3 and then you use the executable of python, I had the same issue and I have resolved using python3.
Try changing your last line of your Dockerfile :
instead of
CMD ["python", "file.py", "-h"]
try :
CMD ["python3", "file.py", "-h"]

Categories

Resources