Pure python solution to convert XHTML to PDF - python

I am after a pure Python solution (for the GAE) to convert webpages to pdf.
I had a look at reportlab but the documentation focuses on generating pdfs from scratch, rather than converting from HTML.
What do you recommend? - pisa?
Edit:
My use case is I have a HTML report that I want to make available in PDF too. I will make updates to this report structure so I don't want to maintain a separate PDF version, but (hopefully) convert automatically.
Also because I generate the report HTML I can ensure it is well formed XHTML to make the PDF conversion easier.

Pisa claims to support what I want to do:
pisa is a html2pdf converter using the
ReportLab Toolkit, the HTML5lib and
pyPdf. It supports HTML 5 and CSS 2.1
(and some of CSS 3). It is completely
written in pure Python so it is
platform independent. The main benefit
of this tool that a user with Web
skills like HTML and CSS is able to
generate PDF templates very quickly
without learning new technologies.
Easy integration into Python
frameworks like CherryPy, KID
Templating, TurboGears, Django, Zope,
Plone, Google AppEngine (GAE) etc.
So I will investigate it further

Have you considered pyPdf? I doubt it has anywhere like the functional richness you require, but, it IS a start, and is in pure Python. The PdfFileWriter class would be the one to generate PDF output, unfortunately it requires PageObject instances and doesn't provide real ways to put those together, except extracting them from existing PDF documents. Unfortunately all richer pdf page-generation packages I can find do appear to depend on reportlab or other non-pure-Python libraries:-(.

What you're asking for is a pure Python HTML renderer, which is a big task to say the least ('real' renderers like webkit are the product of thousands of hours of work). As far as I'm aware, there aren't any.
Instead of looking for an HTML to PDF converter, what I'd suggest is building your report in a format that's easily converted to both - for example, you could build it as a DOM (a set of linked objects), and write converters for both HTML and PDF output. This is a much more limited problem than converting HTML to PDF, and hence much easier to implement.

Related

How to convert pdf to xml /json using python code

Can any one help me on how to convert pdf file to xml file using python code? My pdf contains:
Unstructured data
It has images
Mathematical equations
Chemical Equations
Table Data
Logo's tag's etc.
I tried using PDFMiner, but my pdf data was not converted into .xml/json file format. Are there any libraries other than PDFMiner? PyPDF2, Tabula-py, PDFQuery, comelot, PyMuPDF, pdf to dox, pandas- these other libraries/utilities all not suitable for my requirement.
Please advise me on any other options. Thank you.
The first thing I would recommend you trying is GROBID (see here for the full documentation). You can play with an online demo here to see if fits your needs (select TEI -> Process Fulltext Document, and upload a PDF). You can also check out this from the Allen Institute (it is based on GROBID and has a handy function for converting TEI.XML to JSON).
The other package which--obviously--does a good job is the Adobe PDF Extract API (see here). It's of course a paid service but when you register for an account you get 1.000 document transactions for free. It's easy to implement in Python, well documented, and a good way for experimenting and getting a feel for the difficulties of reliable data extraction from PDF.
I worked with both options to extract text, figures, tables etc. from scientific papers. Both yielded good results. The main problem with out-of-the-box solutions is that, when you work with complex formats (or badly formatted docs), erroneously identified document elements are quite common (for example a footnote or a header gets merged with the main text). Both options are based on machine learning models and, at least for GROBID, it is possible to retrain these models for your specific task (I haven't tried this so far, so I don't know how worthwhile it is).
However, if your target PDFs are all of the same (simple) format (or if you can control their format) you should be fine with either option.

Creating PDFs from HTML/Javascript in Python with no OS dependencies

Is there any way to use Python to create PDF documents from HTML/CSS/Javascript, without introducing any OS-level dependencies?
It seems every existing solution requires special supplemental software, but upon reviewing PDF formatting specifications and HTML/CSS/Javascript rendering, there doesn't appear to be a reason why a Python solution can't exist without them. Some solutions come close, such as pyppeteer, but it still leans on a headless Chrome installation locally. These dependencies mean that microservices can't be leveraged, even though PDF generation would otherwise seem to be a viable use case for them.
While similar questions have come up many times over on SO, there doesn't appear to have been a viable technique shown without having to install specialized dependencies on the OS.
Some similar questions which routinely recommend wkhtmltopdf or are otherwise out of date (e.g., moving PDF printing support outside of Chrome is dead now):
How to convert webpage into PDF by using Python
How to convert a local HTML file to PDF using Python in Windows
HTML to PDF conversion using Chrome pdfium
How does Chrome render PDFs from HTML so well?
Convert a HTML/CSS/Javascript file to PDF using Python?
If I've somehow missed a viable approach, please feel free to mark this as a duplicate with my thanks!
Edit February 2021: It appears that the cefpython project may meet these demands - PDF printing support seems like it could be implemented in the near future.
So to clarify and formalize what others have said:
If you want to create PDF documents from HTML/CSS/javascript content, you will necessarily need a javascript engine (because you obviously need to execute the javascript if it affects the visuals of the document). This is the most complex component that you need.
As for now, there is no ECMAscript compliant engine written in pure python that is well-maintained (that would be a huge project)... There will probably never be one, since compilers and VMs for languages need to be performant and are thus usually written in a performant low-level language.
So you will always need compiled binaries for that and the HTML renderers which are less complex but also need to be performant if used in browsers, so usually they're also C++ or the likes.
The javascript engine and HTML renderer are the major part of a browser, so a headless browser is a good solution to this requirement.
Try this library: xhtml2pdf
It worked for me. Here is the documentation: doc
Some sample code:
from xhtml2pdf import pisa
def convert_html_to_pdf(source_html, output_filename):
# open output file for writing (truncated binary)
result_file = open(output_filename, "w+b")
# convert HTML to PDF
pisa_status = pisa.CreatePDF(
source_html, # the HTML to convert
dest=result_file) # file handle to recieve result
# close output file
result_file.close() # close output file
# return False on success and True on errors
return pisa_status.err
# Define your data
source_html = open('2020-06.html')
output_filename = "test.pdf"
convert_html_to_pdf(source_html, output_filename)

How to automatically generate a PDF of a website?

I have a website that has some charts and graphs made using JavaScript libraries. What's a good way to, server-side, auto-generate the HTML, CSS, and JS, and then capture the result in a PDF / PNG / JPG? I'd like to auto-generate reports and email them to my users.
Any programming language is fine, but Ruby / Rails would be best.
I've heard of the wkhtmltopdf project. With the help of the webkit rendering enginge it produces PDFs from a webpage. It offers Python bindings. Ruby bindings are also available: PDFKit
wkhtmltopdf is a good tool to use. I've just used it to generate 500+ pdf documents in one day using a rake task. If you're interested with gems that take advantage of wkhtmltopdf, then you can try WickedPDF or PDFKit.

How do i integrate a browser in my tk application?

I am writing a Python program and want at some point to load HTML pages from the web and display them, in particular the support forum section/news section of my program. How can I easily integrate that into my program?
It would be easier to switch toolkits - Gtk+/Qt both have support for WebKit rendering engine. Both toolkits also have Python bindings support.
Tkinter has no native support for displaying HTML. With a little bit of work you can download the raw html, do a little text manipulation (assuming it's mostly plain text and a few tags for bold, headers, etc), and display it in a text widget. If it's well formed xhtml you can use an xml parser to pull out the relevant data and give it appropriate formatting.
If the pages have embeded images, or use css or javascript, the problem becomes exponentially harder to the point of not being practical.

What's a good document standard to use programmatically?

I'm writing a program that requires input in the form of a document, it needs to replace a few values, insert a table, and convert it to PDF. It's written in Python + Qt (PyQt). Is there any well known document standard which can be easily used programmatically? It must be cross platform, and preferably open.
I have looked into Microsoft Doc and Docx, which are binary formats and I can't edit them. Python has bindings for it, but they're only on Windows.
Open Office's ODT/ODF is zipped in an xml file, so I can edit that one but there's no command line utilities or any way to programmatically convert the file to a PDF. Open Office provides bindings, but you need to run Open Office from the command line, start a server, etc. And my clients may not have Open Office installed.
RTF is readable from Python, but I couldn't find any way/libraries to convert RTF documents to PDF.
At the moment I'm exporting from Microsoft Word to HTML, replacing the values and using PyQt to convert it to a PDF. However it loses formatting features and looks awful. I'm surprised there isn't a well known library which lets you edit a variety of document formats and convert them into other formats, am I missing something?
Update: Thanks for the advice, I'll have a look at using Latex.
Thanks,
Jackson
Have you looked into using LaTeX documents?
They are perfect to use programatically (compiling documents? You gotta love that...), and you have several Python frameworks you can use such as plasTeX and PyTex.
Exporting a LaTeX documents to PDF is almost immediate.
Since you're already using PyQt anyway, it might be worth looking at Qt's built-in RTF processing module which looks decent. Here's the documentation on detailed content manipulation including inserting tables. Also the QPrinter module's default print-to-file format happens to be PDF.
Without knowing more about your particular needs it's hard to say if these would do what you want, but since your application already has PyQt as a dependency, seems silly to introduce any more without evaluating the functionality you've already got available.
The non-GUI parts of the Qt framework are often overlooked though.
edit: included more links.
You might want to try ReportLab. The open source version can write PDFs, and the commercial version has a lot of really nice abstractions to allow output to a variety of different formats from a single input.
I don't know the kind of odience of your program, Tex is good and i would go with it.
Another possible choice is Excel format, parsing it with xlrd.
I've used it a couple of time and it's pretty straightforward.
Excel file is a good for the following reasons:
Well known format easy to edit
You could prepare a predefined template with constrains and table
Creating XML documents, transforming them to XSL/fo and rendering with Fop or RenderX. If you use docbook as the primary input, there are toolchains freely available for converting that to PDF, RTF, HTML and so forth.
It is rather quirky to use and not my idea of fun, but is does deliver and can be embedded in an application, AFAICT.
Creating docbook is very straightforward as it has a wide range of semantic tags, table support etc to give a "meaningful" markup which can be reliably formatted. The XSL stylesheets are modular and allow parts to be customized or replaced to generate your own look and feel.
It works well for relatively free flow documents with lots of text.
For filling in the blanks kind of documents, a regular reporting engine may be a better fit, or some straighforward XSL stylesheets spitting out the XSL-fo directly.

Categories

Resources