Cannot convert HTML to PDF using wkhtmltopdf or pdfkit

Cannot convert HTML to PDF using wkhtmltopdf or pdfkit - python

I'm trying to convert a large HTML to PDF. So far I tried two approaches with the same no desired result.
I execute this on the terminal:
wkhtmltopdf my_html_file.html my_pdf_file.pdf
On the other hand, I tried the conversion inside a Python script:
import pdfkit
with open('my_html_file.html') as f:
pdfkit.from_file(f, 'my_pdf_file.pdf')
In both cases the output file generated (my_pdf_file.pdf) only has 1 page and does not contain all the content from the HTML file.

Related

Unable to properly format html code to text

I'm trying to create a script where I copy and paste information from one url to another.
I was able to get the html code using playwright and beautiful soup but was unable to format properly.
For example,
<strong>Autoignition Temperature: </strong>
Using html2text, I was able to convert the above html code to
 
AutoignitionTemp = ** Autoignition Temperature: **  
Which I then paste it into the correct website using this code:
pastePage.frame_locator(
"text=Rich Text Editor, txtContent3Editor toolbarsClipboard/Undo Cut Copy Paste Paste >> iframe").locator(
"body").fill(html2text.html2text(AutoignitionTemp))
However, txt editor does not recognised the ** as bolding and paste the text directly as it is:
Pasted Code:
Is there a way I can format the text properly or paste the html code directly into chrome console?

Why can I not get local files to parse using BeautifulSoup4 in Jupyterlab

I'm following a web tutorial trying to use BeautifulSoup4 to extract data from a html file (stored on my local PC) in Jupyterlab as follows:
from bs4 import BeautifulSoup
with open ('simple.html') as html_file:
simple = BeautifulSoup('html_file','lxml')
print(simple.prettify())
I'm getting the following output irrespective of what is in the html file instead of the expected html
<html>
<body>
<p>
html_file
</p>
</body>
</html>
I've also tried it using the html parser html.parser and I simply get html_file as the output.
I know it can find the file because when I run the code after removing it from the directory I get a FileNotFoundError.
It works perfectly well when I run python interactively from the same directory. I'm able to run other BeautifulSoup to parse web pages.
I'm using Fedora 32 linux with Python3, Jupyterlab, BeautifulSoup4,requests, lxml installed in a virtual environment using pipenv.
Any help to get to the bottom of the problem is welcome.

Your problem is in this line:
simple = BeautifulSoup('html_file','lxml')
In particular, you're telling BeautifulSoup to parse the literal string 'html_file' instead of the contents of the variable html_file.
Changing it to:
simple = BeautifulSoup(html_file,'lxml')
(note the lack of quotes surrounding html_file) should give the desired result.

Optimal way to convert PDF file to HTML

I am trying to convert thousands of PDF files to HTML. I was able to convert this PDF file to this HTML file using the following code:
def convertPDFToHtml():
command = 'pdf2txt.py -o output.html -t html test.pdf'
os.system(command)
I want to be able to parse the HTML file so that I can extract different texts from it. The problem now is that the output HTML file is missing a lot of text from the original file.
Is there a better to convert the PDF file and parse the HTML text ?

This is possibly a similar problem as discussed here, unless you specifically want to generate HTML files. But even so, you could first extract the text from the PDFs as simple unformatted text, parse it, and then generate the HTMLs.

Boxes are displayed instead of text in pdfkit - Python3

I am trying to convert an HTML file to pdf using pdfkit python library.
I followed the documentation from here.
Currently, I am trying to convert plain texts to PDF instead of whole html document. Everything is working fine but instead of text, I am seeing boxes in the generated PDF.
This is my code.
import pdfkit
config = pdfkit.configuration(wkhtmltopdf='/usr/local/bin/wkhtmltopdf/wkhtmltox/bin/wkhtmltopdf')
content = 'This is a paragraph which I am trying to convert to pdf.'
pdfkit.from_string(content,'test.pdf',configuration=config)
This is the output.
Instead of the text 'This is a paragraph which I am trying to convert to pdf.', converted PDF contains boxes.
Any help is appreciated.
Thank you :)

Unable to reproduce the issue with Python 2.7 on Ubuntu 16.04 and it works fine on the specs mentioned. From my understanding this problem is from your Operating System not having the font or encoding in which the file is being generated by the pdfkit.
Maybe try doing this:
import pdfkit
config = pdfkit.configuration(wkhtmltopdf='/usr/local/bin/wkhtmltopdf/wkhtmltox/bin/wkhtmltopdf')
content = 'This is a paragraph which I am trying to convert to pdf.'
options = {
'encoding':'utf-8',
}
pdfkit.from_string(content,'test.pdf',configuration=config, options=options)
The options to modify pdf can be added as dictionary and assigned to options argument in from_string functions. The list of options can be found here.

This issue is referred here Include custom fonts in AWS Lambda
if you are using pdfkit on lambda you will have to setup ENV variables as
"FONT_CONFIG_PATH": '/opt/fonts/'
"FONTCONFIG_FILE": '/opt/fonts/fonts.conf'
if this problem is in the local environment a fresh installation of wkhtmltopdf must resolve this

How to efficiently remove table from docx/xml and extract text

I've a problem with extracting text out of .docx after removing table.
The docx files I'm dealing with contain a lot of tables that I would like to get rid of before extracting the text.
I first use docx2html to convert a docx file to html, and then use BeautifulSoup to remove the table tag and extract the text.
from docx2html import convert
from bs4 import BeautifulSoup
...
temp = convert(FileToConvert)
soup = BeautifulSoup(temp)
for i in range(0,len(soup('table'))):
soup.table.decompose()
Text = soup.get_text()
While this process works and produces what I need, there is some efficiency issue with docx2html.convert(). Since .docx files are in infact .xml files, would it be possible to skip the the procedure of converting docx into html and just extract text from the xml after removing tables.

docx files are not just xml files but rather a zipped xml based format, so you won't be able to pass a docx file directly to BeautifulSoup. The format seems pretty simple though as the zipped docx contains a file called word/document.xml which is probably the xml file you want to parse. You can use Python's zipfile module to extract this file and pass its contents directly to BeautfulSoup:
import sys
import zipfile
from bs4 import BeautifulSoup
with zipfile.ZipFile(sys.argv[1], 'r') as zfp:
with zfp.open('word/document.xml') as fp:
soup = BeautifulSoup(fp.read(), 'xml')
print soup
However, you might also want to look at https://github.com/mikemaccana/python-docx, which might do a lot of what you want already. I haven't tried it so I can't vouch for its suitability for your specific use-case.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cannot convert HTML to PDF using wkhtmltopdf or pdfkit - python

Related

Unable to properly format html code to text

Why can I not get local files to parse using BeautifulSoup4 in Jupyterlab

Optimal way to convert PDF file to HTML

Boxes are displayed instead of text in pdfkit - Python3

How to efficiently remove table from docx/xml and extract text

Categories

Resources