Can't convert HTML to PDF using pdfkit (empty file)

Can't convert HTML to PDF using pdfkit (empty file) - python

I want to convert a HTML file to PDF
Example Input
The HTML file from Google Drive looks good, can see content in my browser.
Code
import pdfkit # convert html to pdf
pdfkit.from_file('income_contract_01.html', 'res.pdf')
Issue
I expect to see data in res.pdf. But instead I see an empty file:
no text
but the same number of lines
Environment and versions used:
python 3.9
pdfkit versiion 1.0.0
wkhtmltopdf is the newest version (0.12.2.4-1)
OS Ubuntu 16.04
How can I fix the error? Don't see any error messages.
Update: I was trying to specify library in configuration but it didn't help
import pdfkit # convert html to pdf v1
path = "/usr/bin/wkhtmltopdf"
config = pdfkit.configuration(wkhtmltopdf=path)
pdfkit.from_file('income_contract_01.html', 'res123.pdf', configuration=config)

I solved it with using wkhtmltopdf.
Once you download it from here then put wkhtmltopdf.exe's path to variable in code below which is path_wkhtmltopdf.
import pdfkit
path_wkhtmltopdf = r"C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe"
config = pdfkit.configuration(wkhtmltopdf=path_wkhtmltopdf)
pdfkit.from_url("income_contract.html", "res.pdf", configuration=config)
Result: res.pdf

Related

Python - Download zip files with requests package but get unknown file format

I am using Python 3.8.12. I tried the following code to download files from URLs with the requests package, but got 'Unkown file format' message when opening the zip file. I tested on different zip URLs but the size of all zip files are 18KB and none of the files can be opened successfully.
import requests
file_url = 'https://www.censtatd.gov.
hk/en/EIndexbySubject.html?pcode=D5600091&scode=300&file=D5600091B2022MM11B.zip'
file_download = requests.get(file_url, allow_redirects=True, stream=True)
open(save_path+file_name, 'wb').write(file_download.content)
Zip file opening error message
Zip files size
However, once I updated the url as file_url = 'https://www.td.gov.hk/datagovhk_tis/mttd-csv/en/table41a_eng.csv' the code worked well and the csv file could be downloaded perfectly.
I try to use requests, urllib , wget and zipfile io packages, but none of them work.
The reason may be that the zip URL directs to both the zip file and a web page, while the csv URL directs to the csv file only.
I am really new to this field, could anyone help on it? Thanks a lot!

You might examine headers after sending HEAD request to get information regarding file, examining Content-Type allows you to reveal actual type of file
import requests
file_url = 'https://www.censtatd.gov.hk/en/EIndexbySubject.html?pcode=D5600091&scode=300&file=D5600091B2022MM11B.zip'
r = requests.head(file_url)
print(r.headers["Content-Type"])
gives output
text/html
So file you have URL to is actually HTML page.

import wget
url = 'https://www.censtatd.gov.hk/en/EIndexbySubject.html?
pcode=D5600091&scode=300&file=D5600091B2022MM11B.zip'
#url = 'https://golang.org/dl/go1.17.3.windows-amd64.zip'
wget.download(url)

How to download Flickr images using photos url (does not contain .jpg, .png, etc.) using Python

I want to download image from Flickr using following type of links using Python:
https://www.flickr.com/photos/66176388#N00/2172469872/
https://www.flickr.com/photos/clairity/798067744/
This data is obtained from xml file given at https://snap.stanford.edu/data/web-flickr.html
Is there any Python script or way to download images automatically.
Thanks.

I try to find answer from other sources and compiled the answer as follows:
import re
from urllib import request
def download(url, save_name):
html = request.urlopen(url).read()
html=html.decode('utf-8')
img_url = re.findall(r'https:[^" \\:]*_b\.jpg', html)[0]
print(img_url)
with open(save_name, "wb") as fp:
fp.write(request.urlopen(img_url).read())
download('https://www.flickr.com/photos/clairity/798067744/sizes/l/', 'image.jpg')

Python Tika cannot parse pdf from url

python for parsing the online pdf for future usage. My code are below.
from tika import parser
import requests
import io
url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf'
response = requests.get(url)
with io.BytesIO(response.content) as open_pdf_file:
pdfFile = parser.from_file(open_pdf_file)
print(pdfFile)
However, it shows
AttributeError: '_io.BytesIO' object has no attribute 'decode'
I have taken an example from How can i read a PDF file from inline raw_bytes (not from file)?
In the example, it is using PyPDF2. But I need to use Tika as Tika has a better result than PyPDF2.
Thank you for helping

In order to use tika you will need to have JAVA 8 installed. The code that you'll need to retrieve and print contents of a pdf is as follows:
from tika import parser
url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf'
pdfFile = parser.from_file(url)
print(pdfFile["content"])

HTML to PDF conversion in app engine python

My website has a lot of dynamically generated HTML content and I would like to give my website users a way to save the data in PDF format. Any ideas on how it can be done? I tried xhtml2pdf library but I couldn't get it to work. I even tried reportlibrary but we have to enter the PDF details manually. Is there any library which converts HTML content to PDF and works on app engine?

You need to copy all dependencies into your GAE project folder:
html5lib
reportlab
six
xhtml2pdf
Then you can use it like this:
from xhtml2pdf import pisa
from cStringIO import StringIO
content = StringIO('html goes here')
output = StringIO()
pisa.log.setLevel('DEBUG')
pdf = pisa.CreatePDF(content, output, encoding='utf-8')
pdf_data = pdf.dest.getvalue()
Some useful info that I googled just for you:
http://www.prahladyeri.com/2013/11/how-to-generate-pdf-in-python-for-google-app-engine/
https://github.com/danimajo/pineapple_pdf

web scraping - how to download image into a folder python

I have this code where i would like to download the image and save it into a folder but i am getting the src of the image.I have gone through stack overflow where i found this Batch downloading text and images from URL with Python / urllib / beautifulsoup? but have no idea how to proceed
Here is my code,so far i have tried
elm5=soup.find('div', id="dv-dp-left-content")
img=elm5.find("img")
src = img["src"]
print src
How can i download these images using url into a folder

EDIT: 2021.07.19
Updated from urllib (Python 2) to urllib.request (Python 3)
import urllib.request
f = open('local_file_name','wb')
f.write(urllib.request.urlopen(src).read())
f.close()
src have to be full path - for examplehttp://hostname.com/folder1/folder2/filename.ext.
If src is /folder1/folder2/filename.ext you have to add http://hostname.com/.
If src is folder2/filename.ext you have to add http://hostname.com/folder1/.
etc.
EDIT: example how to download StackOverflow logo :)
import urllib.request
f = open('stackoverflow.png','wb')
f.write(urllib.request.urlopen('https://cdn.sstatic.net/Img/unified/sprites.svg?v=fcc0ea44ba27').read())
f.close()

the src attribute contains the image's url.
you can download it with:
urllib.request.urlretrieve(src, "image.jpg")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't convert HTML to PDF using pdfkit (empty file) - python

Related

Python - Download zip files with requests package but get unknown file format

How to download Flickr images using photos url (does not contain .jpg, .png, etc.) using Python

Python Tika cannot parse pdf from url

HTML to PDF conversion in app engine python

web scraping - how to download image into a folder python

Categories

Resources